You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2016/06/10 06:03:36 UTC

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/13592

    [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0

    ## What changes were proposed in this pull request?
    
    Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete.
    
    ## How was this patch tested?
    
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark sql-programming-guide-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13592.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13592
    
----
commit 92f3f11563934f5d3dd6233663ba3b77fe8bbc67
Author: Cheng Lian <li...@databricks.com>
Date:   2016-06-10T06:02:04Z

    Initial SQL programming guide update for Spark 2.0

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67746893
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Dataset API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` to `SparkSession` for SparkR yet, so
    +the entry point into all relational functionality in SparkR is still the
    +`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need is a `SparkContext`.
     
     {% highlight r %}
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
    --- End diff --
    
    ditto here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67203777
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1142,11 +1141,11 @@ write.parquet(schemaPeople, "people.parquet")
     
     # Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
     # The result of loading a parquet file is also a DataFrame.
    -parquetFile <- read.parquet(sqlContext, "people.parquet")
    +parquetFile <- read.parquet(spark, "people.parquet")
     
     # Parquet files can also be used to create a temporary view and then used in SQL statements.
     registerTempTable(parquetFile, "parquetFile")
    --- End diff --
    
    ```
    createOrReplaceTempView(parquetFile, "parquetFile")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67543099
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -517,24 +517,26 @@ types such as Sequences or Arrays. This RDD can be implicitly converted to a Dat
     registered as a table. Tables can be used in subsequent SQL statements.
     
     {% highlight scala %}
    -// sc is an existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +val spark: SparkSession // An existing SparkSession
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     
     // Define the schema using a case class.
     // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
     // you can use custom classes that implement the Product interface.
     case class Person(name: String, age: Int)
     
    -// Create an RDD of Person objects and register it as a table.
    -val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
    +// Create an RDD of Person objects and register it as a temporary view.
    +val people = sc
    +  .textFile("examples/src/main/resources/people.txt")
    +  .map(_.split(","))
    +  .map(p => Person(p(0), p(1).trim.toInt))
    +  .toDF()
    --- End diff --
    
    I think it's still fair to say that we are using reflection, since all the de/serializer expressions used in encoders are generated using reflection.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67202307
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -171,9 +171,9 @@ df.show()
     
     <div data-lang="r"  markdown="1">
     {% highlight r %}
    -sqlContext <- SQLContext(sc)
    +spark <- SparkSession(sc)
     
    -df <- read.json(sqlContext, "examples/src/main/resources/people.json")
    +df <- read.json(spark, "examples/src/main/resources/people.json")
    --- End diff --
    
    In `SparkR`, the above is deprecated. We can use now like the following.
    ```
    df <- read.json("examples/src/main/resources/people.json")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66872913
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
    --- End diff --
    
    how `, DataFrame API(python/R) and Dataset API(scala/java)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66894292
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all relational functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
     The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +`SparkSession` class. To create a basic `SparkSession`, all you need is a `SparkContext`.
     
     {% highlight r %}
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
     {% endhighlight %}
     
     </div>
     </div>
     
    -In addition to the basic `SQLContext`, you can also create a `HiveContext`, which provides a
    -superset of the functionality provided by the basic `SQLContext`. Additional features include
    -the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
    -ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an
    -existing Hive setup, and all of the data sources available to a `SQLContext` are still available.
    -`HiveContext` is only packaged separately to avoid including all of Hive's dependencies in the default
    -Spark build. If these dependencies are not a problem for your application then using `HiveContext`
    -is recommended for the 1.3 release of Spark. Future releases will focus on bringing `SQLContext` up
    -to feature parity with a `HiveContext`.
    +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
    +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
    +To use these features, you do not need to have an existing Hive setup, and all of the data sources
    +available to a `SparkSession` are still available. These Hive features are only packaged separately
    +to avoid including all of Hive's dependencies in the default Spark build.
     
     
     ## Creating DataFrames
     
    -With a `SQLContext`, applications can create `DataFrame`s from an <a href='#interoperating-with-rdds'>existing `RDD`</a>, from a Hive table, or from <a href='#data-sources'>data sources</a>.
    +With a `SparkSession`, applications can create DataFrames from an <a href='#interoperating-with-rdds'>existing `RDD`</a>,
    +from a Hive table, or from <a href='#data-sources'>data sources</a>.
    --- End diff --
    
    or from data sources => or from Spark data sources? 
    
    To emphasize it is Spark DataSource API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    @maropu Sorry for the late reply. Yea, adding description to these two options makes sense. Would you like to open a PR for this? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    @liancheng Is it worth adding  two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes` in `Other Configuration Options`? They are kinds of internal parameters though, it seems they are useful for the users that would like to control #partitions. https://issues.apache.org/jira/browse/SPARK-15894


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66891847

--- Diff: docs/sql-programming-guide.md ---
@@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
Spark SQL uses this extra information to perform extra optimizations. There are several ways to
-interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
+interact with Spark SQL including SQL and the Datasets API. When computing a result
the same execution engine is used, independent of which API/language you are using to express the
-computation. This unification means that developers can easily switch back and forth between the
-various APIs based on which provides the most natural way to express a given transformation.
+computation. This unification means that developers can easily switch back and forth between
+different APIs based on which provides the most natural way to express a given transformation.

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
+One use of Spark SQL is to execute SQL queries.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
-SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
+SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).

-## DataFrames
+## Datasets and DataFrames

-A DataFrame is a distributed collection of data organized into named columns. It is conceptually
-equivalent to a table in a relational database or a data frame in R/Python, but with richer
-optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
-as: structured data files, tables in Hive, external databases, or existing RDDs.
+A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
+typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
--- End diff --

"with the benefits of" => "as well as" ?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66894492
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all relational functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
     The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +`SparkSession` class. To create a basic `SparkSession`, all you need is a `SparkContext`.
     
     {% highlight r %}
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
     {% endhighlight %}
     
     </div>
     </div>
     
    -In addition to the basic `SQLContext`, you can also create a `HiveContext`, which provides a
    -superset of the functionality provided by the basic `SQLContext`. Additional features include
    -the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
    -ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an
    -existing Hive setup, and all of the data sources available to a `SQLContext` are still available.
    -`HiveContext` is only packaged separately to avoid including all of Hive's dependencies in the default
    -Spark build. If these dependencies are not a problem for your application then using `HiveContext`
    -is recommended for the 1.3 release of Spark. Future releases will focus on bringing `SQLContext` up
    -to feature parity with a `HiveContext`.
    +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
    +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
    +To use these features, you do not need to have an existing Hive setup, and all of the data sources
    +available to a `SparkSession` are still available. These Hive features are only packaged separately
    +to avoid including all of Hive's dependencies in the default Spark build.
     
     
     ## Creating DataFrames
     
    -With a `SQLContext`, applications can create `DataFrame`s from an <a href='#interoperating-with-rdds'>existing `RDD`</a>, from a Hive table, or from <a href='#data-sources'>data sources</a>.
    +With a `SparkSession`, applications can create DataFrames from an <a href='#interoperating-with-rdds'>existing `RDD`</a>,
    +from a Hive table, or from <a href='#data-sources'>data sources</a>.
     
    -As an example, the following creates a `DataFrame` based on the content of a JSON file:
    +As an example, the following creates a DataFrame based on the content of a JSON file:
    --- End diff --
    
    based on the content of a JSON file => based on JSON data source API? 
    So that this example is an ellaboration of 
    
    ```
    With a `SparkSession`, applications can create DataFrames from an existing `RDD`,  from a Hive table, or from data sources
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66873712

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
--- End diff --

why change this line?

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by maropu <gi...@git.apache.org>.

Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    @liancheng okay, I'll do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66663638
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -184,20 +175,20 @@ showDF(df)
     </div>
     
     
    -## DataFrame Operations
    +## Untyped Dataset Operations
     
    -DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), [Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/DataFrame.html).
    +Datasets provide an untyped domain-specific language for structured data manipulation in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/DataFrame.html).
    --- End diff --
    
    Updated. The current strategy is only use "Dataset" when it's clearly in the context of Java and/or Scala API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66893918
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all relational functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
     The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +`SparkSession` class. To create a basic `SparkSession`, all you need is a `SparkContext`.
     
     {% highlight r %}
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
     {% endhighlight %}
     
     </div>
     </div>
     
    -In addition to the basic `SQLContext`, you can also create a `HiveContext`, which provides a
    -superset of the functionality provided by the basic `SQLContext`. Additional features include
    -the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
    -ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an
    -existing Hive setup, and all of the data sources available to a `SQLContext` are still available.
    -`HiveContext` is only packaged separately to avoid including all of Hive's dependencies in the default
    -Spark build. If these dependencies are not a problem for your application then using `HiveContext`
    -is recommended for the 1.3 release of Spark. Future releases will focus on bringing `SQLContext` up
    -to feature parity with a `HiveContext`.
    +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
    +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
    +To use these features, you do not need to have an existing Hive setup, and all of the data sources
    +available to a `SparkSession` are still available. These Hive features are only packaged separately
    +to avoid including all of Hive's dependencies in the default Spark build.
    --- End diff --
    
    I think this paragraph (line 119 to line 123) is not right.
    
    ```available to a `SparkSession` are still available``` stil available to WHAT??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66877689
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -604,49 +607,47 @@ JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").m
       });
     
     // Apply a schema to an RDD of JavaBeans and register it as a table.
    -DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);
    +Dataset<Row> schemaPeople = spark.createDataFrame(people, Person.class);
     schemaPeople.createOrReplaceTempView("people");
     
     // SQL can be run over RDDs that have been registered as tables.
    -DataFrame teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
    +Dataset<Row> teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
     
    -// The results of SQL queries are DataFrames and support all the normal RDD operations.
     // The columns of a row in the result can be accessed by ordinal.
    -List<String> teenagerNames = teenagers.javaRDD().map(new Function<Row, String>() {
    +List<String> teenagerNames = teenagers.map(new MapFunction<Row, String>() {
       public String call(Row row) {
         return "Name: " + row.getString(0);
       }
    -}).collect();
    +}).collectAsList();
     
     {% endhighlight %}
     
     </div>
     
     <div data-lang="python"  markdown="1">
     
    +<!-- TODO: Is the following still true in Spark 2.0? -->
    --- End diff --
    
    looks like it's still valid in python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60698/consoleFull)** for PR 13592 at commit [`f413cbb`](https://github.com/apache/spark/commit/f413cbb643c90e004d0f001e7c659373420167c6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Thanks everyone for the review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67749954
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -419,35 +423,39 @@ In addition to simple column references and expressions, DataFrames also have a
     
     ## Running SQL Queries Programmatically
     
    -The `sql` function on a `SQLContext` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
    -
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
    +The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
    +
     {% highlight scala %}
    -val sqlContext = ... // An existing SQLContext
    -val df = sqlContext.sql("SELECT * FROM table")
    +val spark = ... // An existing SparkSession
    +val df = spark.sql("SELECT * FROM table")
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
    +The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `Dataset<Row>`.
    +
     {% highlight java %}
    -SQLContext sqlContext = ... // An existing SQLContext
    -DataFrame df = sqlContext.sql("SELECT * FROM table")
    +SparkSession spark = ... // An existing SparkSession
    +Dataset<Row> df = spark.sql("SELECT * FROM table")
     {% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
    +The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
    +
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    -df = sqlContext.sql("SELECT * FROM table")
    +# spark is an existing SparkSession
    +df = spark.sql("SELECT * FROM table")
     {% endhighlight %}
     </div>
     
     <div data-lang="r"  markdown="1">
    +The `sql` function enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
    --- End diff --
    
    also, in SparkR, the class is called "SparkDataFrame"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60309/consoleFull)** for PR 13592 at commit [`819e109`](https://github.com/apache/spark/commit/819e109c5aa8a33fd2e589d09f3dbbbed13162e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60664/consoleFull)** for PR 13592 at commit [`4b3c4d3`](https://github.com/apache/spark/commit/4b3c4d31644745d3d74f265644b2d9d406cfe82b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60698 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60698/consoleFull)** for PR 13592 at commit [`f413cbb`](https://github.com/apache/spark/commit/f413cbb643c90e004d0f001e7c659373420167c6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66884198

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
--- End diff --

The "basic SQL" used to refer to the simple SQL dialect we supported in 1.x but removed in 2.0.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67750092
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -170,34 +175,37 @@ df.show()
     </div>
     
     <div data-lang="r"  markdown="1">
    -{% highlight r %}
    -sqlContext <- SQLContext(sc)
    +With a `SQLContext`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
    --- End diff --
    
    ... "SparkSession"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67763740
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Dataset API. When computing a result
    --- End diff --
    
    Seems we should still mention DataFrame at here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66884076
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -517,24 +517,26 @@ types such as Sequences or Arrays. This RDD can be implicitly converted to a Dat
     registered as a table. Tables can be used in subsequent SQL statements.
     
     {% highlight scala %}
    -// sc is an existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +val spark: SparkSession // An existing SparkSession
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     
     // Define the schema using a case class.
     // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
     // you can use custom classes that implement the Product interface.
     case class Person(name: String, age: Int)
     
    -// Create an RDD of Person objects and register it as a table.
    -val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
    +// Create an RDD of Person objects and register it as a temporary view.
    +val people = sc
    +  .textFile("examples/src/main/resources/people.txt")
    +  .map(_.split(","))
    +  .map(p => Person(p(0), p(1).trim.toInt))
    +  .toDF()
     people.createOrReplaceTempView("people")
    --- End diff --
    
    As mentioned in the body text above this sample, this snippet is intentionally demonstrating how to convert an existing RDD to a DataFrame.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r67774733

--- Diff: docs/sql-programming-guide.md ---
@@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
Spark SQL uses this extra information to perform extra optimizations. There are several ways to
-interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
+interact with Spark SQL including SQL and the Dataset API. When computing a result
the same execution engine is used, independent of which API/language you are using to express the
-computation. This unification means that developers can easily switch back and forth between the
-various APIs based on which provides the most natural way to express a given transformation.
+computation. This unification means that developers can easily switch back and forth between
+different APIs based on which provides the most natural way to express a given transformation.

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
+One use of Spark SQL is to execute SQL queries.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
-SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
+SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).

-## DataFrames
+## Datasets and DataFrames

-The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
-[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
-[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
+The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
+2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
+In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
+However, [Java API][java-datasets] users must use `Dataset<Row>` instead.

-## Datasets
+[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
+[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html

-A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
-RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
-optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
-using functional transformations (map, flatMap, filter, etc.).
+Python does not have support for the Dataset API, but due to its dynamic nature many of the
+benefits are already available (i.e. you can access the field of a row by name naturally
+`row.columnName`). The case for R is similar.

-The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
-[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
-the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
-access the field of a row by name naturally `row.columnName`). Full python support will be added
-in a future release.
+Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.

# Getting Started

-## Starting Point: SQLContext
+## Starting Point: SparkSession
--- End diff --

![image](https://cloud.githubusercontent.com/assets/2072857/16211603/607afcb2-36f7-11e6-9e8f-f206f30ea018.png)

The doc looks like this. I am not sure if there is a better way to improve this section (making it clear that SparkSession is not available in SparkR). @felixcheung @shivaram maybe you have better ideas?

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67202611
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -419,35 +419,35 @@ In addition to simple column references and expressions, DataFrames also have a
     
     ## Running SQL Queries Programmatically
     
    -The `sql` function on a `SQLContext` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
    +The `sql` function on a `SparkSession` enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     {% highlight scala %}
    -val sqlContext = ... // An existing SQLContext
    -val df = sqlContext.sql("SELECT * FROM table")
    +val spark = ... // An existing SparkSession
    +val df = spark.sql("SELECT * FROM table")
     {% endhighlight %}
     </div>
     
     <div data-lang="java" markdown="1">
     {% highlight java %}
    -SQLContext sqlContext = ... // An existing SQLContext
    -DataFrame df = sqlContext.sql("SELECT * FROM table")
    +SparkSession spark = ... // An existing SparkSession
    +Dataset<Row> df = spark.sql("SELECT * FROM table")
     {% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    -df = sqlContext.sql("SELECT * FROM table")
    +from pyspark.sql import SparkSession
    +spark = SparkSession(sc)
    +df = spark.sql("SELECT * FROM table")
     {% endhighlight %}
     </div>
     
     <div data-lang="r"  markdown="1">
     {% highlight r %}
    -sqlContext <- sparkRSQL.init(sc)
    -df <- sql(sqlContext, "SELECT * FROM table")
    +spark <- sparkRSQL.init(sc)
    +df <- sql(spark, "SELECT * FROM table")
    --- End diff --
    
    Here, too. Remove `spark <- sparkRSQL.init(sc)` and use
    ```
    df <- sql("SELECT * FROM table")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66877316
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -587,7 +590,7 @@ for the JavaBean.
     
     {% highlight java %}
     // sc is an existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    +SparkSession spark = new org.apache.spark.sql.SparkSession(sc);
     
     // Load a text file and convert each line to a JavaBean.
     JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").map(
    --- End diff --
    
    is this example still valid?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66893224

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-## DataFrames
+## Datasets and DataFrames

I will remove this line ~~The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3.~~

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67203431
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1142,11 +1141,11 @@ write.parquet(schemaPeople, "people.parquet")
     
     # Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
     # The result of loading a parquet file is also a DataFrame.
    -parquetFile <- read.parquet(sqlContext, "people.parquet")
    +parquetFile <- read.parquet(spark, "people.parquet")
    --- End diff --
    
    ```
    parquetFile <- read.parquet("people.parquet")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67746855
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Dataset API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` to `SparkSession` for SparkR yet, so
    +the entry point into all relational functionality in SparkR is still the
    +`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need is a `SparkContext`.
    --- End diff --
    
    This has in fact been changed and merged. The entrypoint is now `sparkR.session()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66891502
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
    --- End diff --
    
    "Dataset API" instead of "Datasets API"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r67763685

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
+One use of Spark SQL is to execute SQL queries.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
-SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
+SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).

-## DataFrames
+## Datasets and DataFrames

`A Dataset is a new interface`?

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    @felixcheung Thanks for the review and your work on PR #13751! Was traveling during the weekend. Let's address these comments in follow-up PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r67773057

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-## DataFrames
+## Datasets and DataFrames

`the successor of the DataFrame API` sounds weird.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60277 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60277/consoleFull)** for PR 13592 at commit [`92f3f11`](https://github.com/apache/spark/commit/92f3f11563934f5d3dd6233663ba3b77fe8bbc67).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66566102
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,121 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not yet have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
     The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +[`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, all you need is a `SparkContext`.
    --- End diff --
    
    you don't need a sparkcontexgt to create a session


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13592


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66877088
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -517,24 +517,26 @@ types such as Sequences or Arrays. This RDD can be implicitly converted to a Dat
     registered as a table. Tables can be used in subsequent SQL statements.
     
     {% highlight scala %}
    -// sc is an existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +val spark: SparkSession // An existing SparkSession
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     
     // Define the schema using a case class.
     // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
     // you can use custom classes that implement the Product interface.
     case class Person(name: String, age: Int)
     
    -// Create an RDD of Person objects and register it as a table.
    -val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
    +// Create an RDD of Person objects and register it as a temporary view.
    +val people = sc
    +  .textFile("examples/src/main/resources/people.txt")
    +  .map(_.split(","))
    +  .map(p => Person(p(0), p(1).trim.toInt))
    +  .toDF()
    --- End diff --
    
    There is no reflection anymore, now we always use the type `T` to create encoder and serialize the object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67203021
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -956,30 +954,30 @@ file directly with SQL.
     <div data-lang="scala"  markdown="1">
     
     {% highlight scala %}
    -val df = sqlContext.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
    +val df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java"  markdown="1">
     
     {% highlight java %}
    -DataFrame df = sqlContext.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");
    +Dataset<Row> df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");
     {% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
     {% highlight python %}
    -df = sqlContext.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
    +df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
     {% highlight r %}
    -df <- sql(sqlContext, "SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
    +df <- sql(spark, "SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
    --- End diff --
    
    The same.
    ```
    df <- sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66699400
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1607,13 +1600,13 @@ a regular multi-line JSON file will most often fail.
     
     {% highlight r %}
     # sc is an existing SparkContext.
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
    --- End diff --
    
    Currently, `sparkRSQL.init` call `org.apache.spark.sql.api.r.SQLUtils.createSQLContext` which return `SQLContext` object not `SparkSession` object. So here it seems to update the R api ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66566143
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -184,20 +175,20 @@ showDF(df)
     </div>
     
     
    -## DataFrame Operations
    +## Untyped Dataset Operations
    --- End diff --
    
    Untyped Dataset Operations (aka DataFrame operations)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66700277
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1607,13 +1600,13 @@ a regular multi-line JSON file will most often fail.
     
     {% highlight r %}
     # sc is an existing SparkContext.
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
    --- End diff --
    
    R API is still in experimental status, and we haven't introduced `SparkSession` to SparkR yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66884213
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all relational functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
     The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +`SparkSession` class. To create a basic `SparkSession`, all you need is a `SparkContext`.
    --- End diff --
    
    Good catch, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67202506
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -363,10 +363,10 @@ In addition to simple column references and expressions, DataFrames also have a
     
     <div data-lang="r"  markdown="1">
     {% highlight r %}
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
     
     # Create the DataFrame
    -df <- read.json(sqlContext, "examples/src/main/resources/people.json")
    +df <- read.json(spark, "examples/src/main/resources/people.json")
    --- End diff --
    
    We can remove the following.
    ```
    spark <- sparkRSQL.init(sc)
    ```
    And, use the following.
    ```
    df <- read.json("examples/src/main/resources/people.json")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66894132
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all relational functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
     The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +`SparkSession` class. To create a basic `SparkSession`, all you need is a `SparkContext`.
     
     {% highlight r %}
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
     {% endhighlight %}
     
     </div>
     </div>
     
    -In addition to the basic `SQLContext`, you can also create a `HiveContext`, which provides a
    -superset of the functionality provided by the basic `SQLContext`. Additional features include
    -the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
    -ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an
    -existing Hive setup, and all of the data sources available to a `SQLContext` are still available.
    -`HiveContext` is only packaged separately to avoid including all of Hive's dependencies in the default
    -Spark build. If these dependencies are not a problem for your application then using `HiveContext`
    -is recommended for the 1.3 release of Spark. Future releases will focus on bringing `SQLContext` up
    -to feature parity with a `HiveContext`.
    +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
    +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
    +To use these features, you do not need to have an existing Hive setup, and all of the data sources
    +available to a `SparkSession` are still available. These Hive features are only packaged separately
    +to avoid including all of Hive's dependencies in the default Spark build.
     
     
     ## Creating DataFrames
     
    -With a `SQLContext`, applications can create `DataFrame`s from an <a href='#interoperating-with-rdds'>existing `RDD`</a>, from a Hive table, or from <a href='#data-sources'>data sources</a>.
    +With a `SparkSession`, applications can create DataFrames from an <a href='#interoperating-with-rdds'>existing `RDD`</a>,
    --- End diff --
    
    Are we using different style of link? I see another style like `[text](link)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60664/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60637/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60637/consoleFull)** for PR 13592 at commit [`200a68c`](https://github.com/apache/spark/commit/200a68c9fb376ada02205e71b218f0905103b0cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r67775004

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
+One use of Spark SQL is to execute SQL queries.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
-SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
+SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).

-## DataFrames
+## Datasets and DataFrames

-## Datasets
+[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
+[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html

# Getting Started

-## Starting Point: SQLContext
+## Starting Point: SparkSession
--- End diff --

nvm. Just saw https://github.com/apache/spark/pull/13592/files#r67746855

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66875336
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -171,9 +171,9 @@ df.show()
     
     <div data-lang="r"  markdown="1">
     {% highlight r %}
    -sqlContext <- SQLContext(sc)
    +spark <- SparkSession(sc)
    --- End diff --
    
    SparkR doesn't have SparkSession


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66566055

--- Diff: docs/sql-programming-guide.md ---
@@ -12,130 +12,121 @@ title: Spark SQL and DataFrames
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
Spark SQL uses this extra information to perform extra optimizations. There are several ways to
-interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
+interact with Spark SQL including SQL and the Datasets API. When computing a result
the same execution engine is used, independent of which API/language you are using to express the
-computation. This unification means that developers can easily switch back and forth between the
-various APIs based on which provides the most natural way to express a given transformation.
+computation. This unification means that developers can easily switch back and forth between
+different APIs based on which provides the most natural way to express a given transformation.

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-## DataFrames
+## Datasets and DataFrames

-## Datasets
+[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
+[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html

-A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
-RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
-optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
-using functional transformations (map, flatMap, filter, etc.).
+Python does not yet have support for the Dataset API, but due to its dynamic nature many of the
--- End diff --

remove "yet"

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Could you please update docs/sparkr.md too which cross link to, eg. sql-programming-guide.html#starting-point-sqlcontext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66878387

--- Diff: docs/sql-programming-guide.md ---
@@ -1650,14 +1646,15 @@ SELECT * FROM jsonTable
## Hive Tables

Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
-However, since Hive has a large number of dependencies, it is not included in the default Spark assembly.
-Hive support is enabled by adding the `-Phive` and `-Phive-thriftserver` flags to Spark's build.
-This command builds a new assembly directory that includes Hive. Note that this Hive assembly directory must also be present
-on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries
-(SerDes) in order to access data stored in Hive.
+However, since Hive has a large number of dependencies, these dependencies are not included in the
+default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them
+automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as
+they will need access to the Hive serialization and deserialization libraries (SerDes) in order to
+access data stored in Hive.

-Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` (for security configuration),
-`hdfs-site.xml` (for HDFS configuration) file in `conf/`.
+Configuration of Hive is done by placing your `core-site.xml` (for security configuration),
+`hdfs-site.xml` (for HDFS configuration) file in `conf/`, and adding configurations in your
+`hive-site.xml` into `conf/spark-defaults.conf`.
--- End diff --

it will not be true soon, users only need to put `hive-site.xml` in classpath

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66663884
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,121 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not yet have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
     The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +[`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, all you need is a `SparkContext`.
    --- End diff --
    
    Thanks. Also updated all sample code that mentions `SparkContext` accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60309/consoleFull)** for PR 13592 at commit [`819e109`](https://github.com/apache/spark/commit/819e109c5aa8a33fd2e589d09f3dbbbed13162e0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60698/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60309/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66892238

All of the examples on this page use sample data included in the Spark distribution and can be run in
the `spark-shell`, `pyspark` shell, or `sparkR` shell.

## SQL

-## DataFrames
+## Datasets and DataFrames

Maybe `[constructed](#creating-datasets) from JVM objects` => `[created from JVM objects](#creating-datasets)`

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67204480
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1326,7 +1325,7 @@ write.df(df1, "data/test_table/key=1", "parquet", "overwrite")
     write.df(df2, "data/test_table/key=2", "parquet", "overwrite")
     
     # Read the partitioned table
    -df3 <- read.df(sqlContext, "data/test_table", "parquet", mergeSchema="true")
    +df3 <- read.df(spark, "data/test_table", "parquet", mergeSchema="true")
    --- End diff --
    
    ```
    df3 <- read.df("data/test_table", "parquet", mergeSchema="true")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66886012
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -587,7 +590,7 @@ for the JavaBean.
     
     {% highlight java %}
     // sc is an existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    +SparkSession spark = new org.apache.spark.sql.SparkSession(sc);
     
     // Load a text file and convert each line to a JavaBean.
     JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").map(
    --- End diff --
    
    It's still valid, but we should probably update it without using `SparkContext`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60277/consoleFull)** for PR 13592 at commit [`92f3f11`](https://github.com/apache/spark/commit/92f3f11563934f5d3dd6233663ba3b77fe8bbc67).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    @felixcheung I merged this one since I think it is better to make changes in parallel using this version as the foundation. Can you help on revising the R related doc? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66566175
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -184,20 +175,20 @@ showDF(df)
     </div>
     
     
    -## DataFrame Operations
    +## Untyped Dataset Operations
     
    -DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), [Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/DataFrame.html).
    +Datasets provide an untyped domain-specific language for structured data manipulation in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/DataFrame.html).
    --- End diff --
    
    this is pretty confusing because it makes it sound like in Python and R people should also program Datasets.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67202934
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -939,7 +937,7 @@ df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")
     
     {% highlight r %}
     
    -df <- read.df(sqlContext, "examples/src/main/resources/people.json", "json")
    +df <- read.df(spark, "examples/src/main/resources/people.json", "json")
    --- End diff --
    
    ```
    df <- read.df("examples/src/main/resources/people.json", "json")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60637/consoleFull)** for PR 13592 at commit [`200a68c`](https://github.com/apache/spark/commit/200a68c9fb376ada02205e71b218f0905103b0cb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by clockfly <gi...@git.apache.org>.

Github user clockfly commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66893943
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all relational functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
     The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +`SparkSession` class. To create a basic `SparkSession`, all you need is a `SparkContext`.
     
     {% highlight r %}
    -sqlContext <- sparkRSQL.init(sc)
    +spark <- sparkRSQL.init(sc)
     {% endhighlight %}
     
     </div>
     </div>
     
    -In addition to the basic `SQLContext`, you can also create a `HiveContext`, which provides a
    -superset of the functionality provided by the basic `SQLContext`. Additional features include
    -the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
    -ability to read data from Hive tables. To use a `HiveContext`, you do not need to have an
    -existing Hive setup, and all of the data sources available to a `SQLContext` are still available.
    -`HiveContext` is only packaged separately to avoid including all of Hive's dependencies in the default
    -Spark build. If these dependencies are not a problem for your application then using `HiveContext`
    -is recommended for the 1.3 release of Spark. Future releases will focus on bringing `SQLContext` up
    -to feature parity with a `HiveContext`.
    +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
    +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
    +To use these features, you do not need to have an existing Hive setup, and all of the data sources
    +available to a `SparkSession` are still available. These Hive features are only packaged separately
    +to avoid including all of Hive's dependencies in the default Spark build.
     
     
     ## Creating DataFrames
     
    -With a `SQLContext`, applications can create `DataFrame`s from an <a href='#interoperating-with-rdds'>existing `RDD`</a>, from a Hive table, or from <a href='#data-sources'>data sources</a>.
    +With a `SparkSession`, applications can create DataFrames from an <a href='#interoperating-with-rdds'>existing `RDD`</a>,
    --- End diff --
    
    DataFrames => DataFrame?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66874954
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,130 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Datasets API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [Dataset\[Row\]](#datasets).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (map, flatMap, filter, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark SQL is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all relational functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
     The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +`SparkSession` class. To create a basic `SparkSession`, all you need is a `SparkContext`.
    --- End diff --
    
    Spark R still use SparkContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60277/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67202830
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -889,7 +887,7 @@ df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
     <div data-lang="r"  markdown="1">
     
     {% highlight r %}
    -df <- read.df(sqlContext, "examples/src/main/resources/users.parquet")
    +df <- read.df(spark, "examples/src/main/resources/users.parquet")
    --- End diff --
    
    ```
    df <- read.df("examples/src/main/resources/users.parquet")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67749843
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1740,17 +1759,14 @@ results = spark.sql("FROM src SELECT key, value").collect()
     
     <div data-lang="r"  markdown="1">
     
    -When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
    +When working with Hive one must construct a `HiveContext`, which inherits from `SparkSession`, and
    --- End diff --
    
    ditto here, `spark.session(enableHiveSupport = TRUE)` instead.
    HiveContext has been deprecated in SparkR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    Thanks! Let's get it in first and then we can revise it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66566019
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1,7 +1,7 @@
     ---
     layout: global
    -displayTitle: Spark SQL, DataFrames and Datasets Guide
    -title: Spark SQL and DataFrames
    +displayTitle: Spark SQL and Datasets Guide
    +title: Spark SQL and Datasets
    --- End diff --
    
    I'd keep DataFrame in the title since Python is using it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67774982
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Dataset API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` to `SparkSession` for SparkR yet, so
    +the entry point into all relational functionality in SparkR is still the
    +`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need is a `SparkContext`.
    --- End diff --
    
    ah, just saw it. Great!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66700281
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -517,24 +517,26 @@ types such as Sequences or Arrays. This RDD can be implicitly converted to a Dat
     registered as a table. Tables can be used in subsequent SQL statements.
     
     {% highlight scala %}
    -// sc is an existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +val spark: SparkSession // An existing SparkSession
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     
     // Define the schema using a case class.
     // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
     // you can use custom classes that implement the Product interface.
     case class Person(name: String, age: Int)
     
    -// Create an RDD of Person objects and register it as a table.
    -val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
    +// Create an RDD of Person objects and register it as a temporary view.
    +val people = sc
    +  .textFile("examples/src/main/resources/people.txt")
    +  .map(_.split(","))
    +  .map(p => Person(p(0), p(1).trim.toInt))
    +  .toDF()
     people.createOrReplaceTempView("people")
    --- End diff --
    
    Here it seems better to update the input data file as json format, and then can use `SparkSession.read.json('path/to/data.json')` so we don't need to use SparkContext, and 
    can directly get a `DataFrame`, it can simplify the example code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13592: [SPARK-15863][SQL][DOC] Initial SQL programming guide up...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13592
  
    **[Test build #60664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60664/consoleFull)** for PR 13592 at commit [`4b3c4d3`](https://github.com/apache/spark/commit/4b3c4d31644745d3d74f265644b2d9d406cfe82b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67747099
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets API. When computing a result
    +interact with Spark SQL including SQL and the Dataset API. When computing a result
     the same execution engine is used, independent of which API/language you are using to express the
    -computation. This unification means that developers can easily switch back and forth between the
    -various APIs based on which provides the most natural way to express a given transformation.
    +computation. This unification means that developers can easily switch back and forth between
    +different APIs based on which provides the most natural way to express a given transformation.
     
     All of the examples on this page use sample data included in the Spark distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) section. When running
    -SQL from within another programming language the results will be returned as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned as a [DataFrame](#datasets-and-dataframes).
     You can also interact with the SQL interface using the [command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then
    +manipulated using functional transformations (`map`, `flatMap`, `filter`, etc.).
     
    -The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL's
    -optimized execution engine. A Dataset can be [constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic nature many of the
    +benefits are already available (i.e. you can access the field of a row by name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, or one
    -of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
    +Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` to `SparkSession` for SparkR yet, so
    +the entry point into all relational functionality in SparkR is still the
    +`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need is a `SparkContext`.
    --- End diff --
    
    And use of SQLContext is deprecated. Please see PR #13751 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r66875208
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -145,10 +145,10 @@ df.show()
     
     <div data-lang="java" markdown="1">
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    +SparkSession spark = ...; // An existing SparkSession.
    +SparkSession spark = new org.apache.spark.sql.SparkSession(sc);
    --- End diff --
    
    hm?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67204146
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1142,11 +1141,11 @@ write.parquet(schemaPeople, "people.parquet")
     
     # Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
     # The result of loading a parquet file is also a DataFrame.
    -parquetFile <- read.parquet(sqlContext, "people.parquet")
    +parquetFile <- read.parquet(spark, "people.parquet")
     
     # Parquet files can also be used to create a temporary view and then used in SQL statements.
     registerTempTable(parquetFile, "parquetFile")
    -teenagers <- sql(sqlContext, "SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
    +teenagers <- sql(spark, "SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
    --- End diff --
    
    ```
    teenagers <- sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org