You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by xuanyuanking <gi...@git.apache.org> on 2018/10/16 12:24:07 UTC

[GitHub] spark pull request #22746: [SPARK-24499][Doc] Split the page of sql-programm...

GitHub user xuanyuanking opened a pull request:

    https://github.com/apache/spark/pull/22746

    [SPARK-24499][Doc] Split the page of sql-programming-guide

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuanyuanking/spark SPARK-24499

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22746.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22746
    
----
commit c2ad4a3420db08a4c8dbe5c3bbfb9938e3c73fff
Author: Yuanjian Li <xy...@...>
Date:   2018-10-16T12:16:50Z

    Split the page of sql-programming-guide

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97519/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226227872
  
    --- Diff: docs/sql-data-sources.md ---
    @@ -0,0 +1,42 @@
    +---
    +layout: global
    +title: Data Sources
    +displayTitle: Data Sources
    +---
    +
    +
    +Spark SQL supports operating on a variety of data sources through the DataFrame interface.
    +A DataFrame can be operated on using relational transformations and can also be used to create a temporary view.
    +Registering a DataFrame as a temporary view allows you to run SQL queries over its data. This section
    +describes the general methods for loading and saving data using the Spark Data Sources and then
    +goes into specific options that are available for the built-in data sources.
    +
    +
    +* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html)
    +  * [Manually Sepcifying Options](sql-data-sources-load-save-functions.html#manually-sepcifying-options)
    --- End diff --
    
    `sepcifying` -> `specifying`. In other places, too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97453 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97453/testReport)** for PR 22746 at commit [`c2ad4a3`](https://github.com/apache/spark/commit/c2ad4a3420db08a4c8dbe5c3bbfb9938e3c73fff).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97482/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    This is cool +1 👍 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97535/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226239048
  
    --- Diff: docs/sql-data-sources-parquet.md ---
    @@ -0,0 +1,321 @@
    +---
    +layout: global
    +title: Parquet Files
    +displayTitle: Parquet Files
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems.
    +Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema
    +of the original data. When writing Parquet files, all columns are automatically converted to be nullable for
    +compatibility reasons.
    +
    +### Loading Data Programmatically
    +
    +Using the data from the above example:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example basic_parquet_example scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example basic_parquet_example java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +{% include_example basic_parquet_example python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +{% include_example basic_parquet_example r/RSparkSQLExample.R %}
    +
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TEMPORARY VIEW parquetTable
    +USING org.apache.spark.sql.parquet
    +OPTIONS (
    +  path "examples/src/main/resources/people.parquet"
    +)
    +
    +SELECT * FROM parquetTable
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +### Partition Discovery
    +
    +Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
    +table, data are usually stored in different directories, with partitioning column values encoded in
    +the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet)
    +are able to discover and infer partitioning information automatically.
    +For example, we can store all our previously used
    +population data into a partitioned table using the following directory structure, with two extra
    +columns, `gender` and `country` as partitioning columns:
    +
    +{% highlight text %}
    +
    +path
    +└── to
    +    └── table
    +        ├── gender=male
    +        │   ├── ...
    +        │   │
    +        │   ├── country=US
    +        │   │   └── data.parquet
    +        │   ├── country=CN
    +        │   │   └── data.parquet
    +        │   └── ...
    +        └── gender=female
    +            ├── ...
    +            │
    +            ├── country=US
    +            │   └── data.parquet
    +            ├── country=CN
    +            │   └── data.parquet
    +            └── ...
    +
    +{% endhighlight %}
    +
    +By passing `path/to/table` to either `SparkSession.read.parquet` or `SparkSession.read.load`, Spark SQL
    +will automatically extract the partitioning information from the paths.
    +Now the schema of the returned DataFrame becomes:
    +
    +{% highlight text %}
    +
    +root
    +|-- name: string (nullable = true)
    +|-- age: long (nullable = true)
    +|-- gender: string (nullable = true)
    +|-- country: string (nullable = true)
    +
    +{% endhighlight %}
    +
    +Notice that the data types of the partitioning columns are automatically inferred. Currently,
    +numeric data types, date, timestamp and string type are supported. Sometimes users may not want
    +to automatically infer the data types of the partitioning columns. For these use cases, the
    +automatic type inference can be configured by
    +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type
    +inference is disabled, string type will be used for the partitioning columns.
    +
    +Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths
    +by default. For the above example, if users pass `path/to/table/gender=male` to either
    +`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not be considered as a
    +partitioning column. If users need to specify the base path that partition discovery
    +should start with, they can set `basePath` in the data source options. For example,
    +when `path/to/table/gender=male` is the path of the data and
    +users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
    +
    +### Schema Merging
    +
    +Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
    +a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
    +up with multiple Parquet files with different but mutually compatible schemas. The Parquet data
    +source is now able to automatically detect this case and merge schemas of all these files.
    +
    +Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we
    +turned it off by default starting from 1.5.0. You may enable it by
    +
    +1. setting data source option `mergeSchema` to `true` when reading Parquet files (as shown in the
    +   examples below), or
    +2. setting the global SQL option `spark.sql.parquet.mergeSchema` to `true`.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example schema_merging scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example schema_merging java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +{% include_example schema_merging python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +{% include_example schema_merging r/RSparkSQLExample.R %}
    +
    +</div>
    +
    +</div>
    +
    +### Hive metastore Parquet table conversion
    +
    +When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
    +Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
    +`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default.
    +
    +#### Hive/Parquet Schema Reconciliation
    +
    +There are two key differences between Hive and Parquet from the perspective of table schema
    +processing.
    +
    +1. Hive is case insensitive, while Parquet is not
    +1. Hive considers all columns nullable, while nullability in Parquet is significant
    +
    +Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
    +Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
    +
    +1. Fields that have the same name in both schema must have the same data type regardless of
    +   nullability. The reconciled field should have the data type of the Parquet side, so that
    +   nullability is respected.
    +
    +1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
    +
    +   - Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
    +   - Any fields that only appear in the Hive metastore schema are added as nullable field in the
    +     reconciled schema.
    +
    +#### Metadata Refreshing
    +
    +Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
    +conversion is enabled, metadata of those converted tables are also cached. If these tables are
    +updated by Hive or other external tools, you need to refresh them manually to ensure consistent
    +metadata.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +// spark is an existing SparkSession
    +spark.catalog.refreshTable("my_table")
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +
    +{% highlight java %}
    +// spark is an existing SparkSession
    +spark.catalog().refreshTable("my_table");
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +{% highlight python %}
    +# spark is an existing SparkSession
    +spark.catalog.refreshTable("my_table")
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +{% highlight r %}
    +refreshTable("my_table")
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +REFRESH TABLE my_table;
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +### Configuration
    +
    +Configuration of Parquet can be done using the `setConf` method on `SparkSession` or by running
    +`SET key=value` commands using SQL.
    +
    +<table class="table">
    +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
    +<tr>
    +  <td><code>spark.sql.parquet.binaryAsString</code></td>
    +  <td>false</td>
    +  <td>
    +    Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do
    --- End diff --
    
    nit: `in paticular Impala, ...` -> `in paticular, Impala, ...`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    My pleasure, thanks for reviewing this!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97519/testReport)** for PR 22746 at commit [`27b066d`](https://github.com/apache/spark/commit/27b066d7635bf2d7a04c869468b3ea9273f75ef6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4064/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226011057
  
    --- Diff: docs/_data/menu-sql.yaml ---
    @@ -0,0 +1,79 @@
    +- text: Getting Started
    +  url: sql-getting-started.html
    +  subitems:
    +    - text: "Starting Point: SparkSession"
    +      url: sql-getting-started.html#starting-point-sparksession
    +    - text: Creating DataFrames
    +      url: sql-getting-started.html#creating-dataframes
    +    - text: Untyped Dataset Operations
    --- End diff --
    
    make sense, keep same with `sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations`, done in b3fc39d.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97535 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97535/testReport)** for PR 22746 at commit [`17995f9`](https://github.com/apache/spark/commit/17995f92bf4f7b6831f129b558669346a5eafedf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226235672
  
    --- Diff: docs/sql-data-sources-load-save-functions.md ---
    @@ -0,0 +1,283 @@
    +---
    +layout: global
    +title: Generic Load/Save Functions
    +displayTitle: Generic Load/Save Functions
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +
    +In the simplest form, the default data source (`parquet` unless otherwise configured by
    +`spark.sql.sources.default`) will be used for all operations.
    +
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +{% include_example generic_load_save_functions scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example generic_load_save_functions java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +{% include_example generic_load_save_functions python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +{% include_example generic_load_save_functions r/RSparkSQLExample.R %}
    +
    +</div>
    +</div>
    +
    +### Manually Specifying Options
    +
    +You can also manually specify the data source that will be used along with any extra options
    +that you would like to pass to the data source. Data sources are specified by their fully qualified
    +name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short
    +names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data
    +source type can be converted into other types using this syntax.
    +
    +To load a JSON file you can use:
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +{% include_example manual_load_options scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example manual_load_options java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example manual_load_options python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +{% include_example manual_load_options r/RSparkSQLExample.R %}
    +</div>
    +</div>
    +
    +To load a CSV file you can use:
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +{% include_example manual_load_options_csv scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example manual_load_options_csv java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example manual_load_options_csv python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
    +
    +</div>
    +</div>
    +
    +### Run SQL on files directly
    +
    +Instead of using read API to load a file into DataFrame and query it, you can also query that
    +file directly with SQL.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +{% include_example direct_sql scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example direct_sql java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example direct_sql python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +{% include_example direct_sql r/RSparkSQLExample.R %}
    +
    +</div>
    +</div>
    +
    +### Save Modes
    +
    +Save operations can optionally take a `SaveMode`, that specifies how to handle existing data if
    +present. It is important to realize that these save modes do not utilize any locking and are not
    +atomic. Additionally, when performing an `Overwrite`, the data will be deleted before writing out the
    +new data.
    +
    +<table class="table">
    +<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
    +<tr>
    +  <td><code>SaveMode.ErrorIfExists</code> (default)</td>
    +  <td><code>"error" or "errorifexists"</code> (default)</td>
    +  <td>
    +    When saving a DataFrame to a data source, if data already exists,
    +    an exception is expected to be thrown.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>SaveMode.Append</code></td>
    +  <td><code>"append"</code></td>
    +  <td>
    +    When saving a DataFrame to a data source, if data/table already exists,
    +    contents of the DataFrame are expected to be appended to existing data.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>SaveMode.Overwrite</code></td>
    +  <td><code>"overwrite"</code></td>
    +  <td>
    +    Overwrite mode means that when saving a DataFrame to a data source,
    +    if data/table already exists, existing data is expected to be overwritten by the contents of
    +    the DataFrame.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>SaveMode.Ignore</code></td>
    +  <td><code>"ignore"</code></td>
    +  <td>
    +    Ignore mode means that when saving a DataFrame to a data source, if data already exists,
    +    the save operation is expected to not save the contents of the DataFrame and to not
    --- End diff --
    
    nit: `expected to not ... to not ...` -> `expected not to ... not to ...`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97453/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r225797461
  
    --- Diff: docs/_data/menu-sql.yaml ---
    @@ -0,0 +1,79 @@
    +- text: Getting Started
    +  url: sql-getting-started.html
    +  subitems:
    +    - text: "Starting Point: SparkSession"
    +      url: sql-getting-started.html#starting-point-sparksession
    +    - text: Creating DataFrames
    +      url: sql-getting-started.html#creating-dataframes
    +    - text: Untyped Dataset Operations
    --- End diff --
    
    how about `Untyped Dataset Operations (DataFrame operations)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97500/testReport)** for PR 22746 at commit [`b3fc39d`](https://github.com/apache/spark/commit/b3fc39d005e985b4ec769e10a4221c5b4d0591b4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226202227
  
    --- Diff: docs/_data/menu-sql.yaml ---
    @@ -0,0 +1,81 @@
    +- text: Getting Started
    +  url: sql-getting-started.html
    +  subitems:
    +    - text: "Starting Point: SparkSession"
    +      url: sql-getting-started.html#starting-point-sparksession
    +    - text: Creating DataFrames
    +      url: sql-getting-started.html#creating-dataframes
    +    - text: Untyped Dataset Operations (DataFrame operations)
    +      url: sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations
    +    - text: Running SQL Queries Programmatically
    +      url: sql-getting-started.html#running-sql-queries-programmatically
    +    - text: Global Temporary View
    +      url: sql-getting-started.html#global-temporary-view
    +    - text: Creating Datasets
    +      url: sql-getting-started.html#creating-datasets
    +    - text: Interoperating with RDDs
    +      url: sql-getting-started.html#interoperating-with-rdds
    +    - text: Aggregations
    +      url: sql-getting-started.html#aggregations
    +- text: Data Sources
    +  url: sql-data-sources.html
    +  subitems:
    +    - text: "Generic Load/Save Functions"
    +      url: sql-data-sources-load-save-functions.html
    +    - text: Parquet Files
    +      url: sql-data-sources-parquet.html
    +    - text: ORC Files
    +      url: sql-data-sources-other.html#orc-files
    +    - text: JSON Datasets
    +      url: sql-data-sources-other.html#json-datasets
    +    - text: Hive Tables
    +      url: sql-data-sources-hive-tables.html
    +    - text: JDBC To Other Databases
    +      url: sql-data-sources-jdbc.html
    +    - text: Avro Files
    +      url: sql-data-sources-avro.html
    +    - text: Troubleshooting
    +      url: sql-data-sources-other.html#troubleshooting
    --- End diff --
    
    Make sense, will split into `sql-data-sources-orc`, `sql-data-sources-json` and `sql-data-sources-troubleshooting`(still need sql-data-sources prefix cause [here](https://github.com/apache/spark/pull/22746/files#diff-5075091c2498292f7afcac68bfd63e1eR13) we need "sql-data-sources" as the nav-left tag, otherwise the nav menu will not show the subitems).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r225783658
  
    --- Diff: docs/sql-reference.md ---
    @@ -0,0 +1,641 @@
    +---
    +layout: global
    +title: Reference
    +displayTitle: Reference
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Data Types
    +
    +Spark SQL and DataFrames support the following data types:
    +
    +* Numeric types
    +    - `ByteType`: Represents 1-byte signed integer numbers.
    --- End diff --
    
    nit: use 2 space indent.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226213393
  
    --- Diff: docs/sql-data-sources-other.md ---
    @@ -0,0 +1,114 @@
    +---
    +layout: global
    +title: Other Data Sources
    +displayTitle: Other Data Sources
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## ORC Files
    +
    +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files.
    +To do that, the following configurations are newly added. The vectorized reader is used for the
    +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl`
    +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC
    +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
    +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`.
    +
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td><code>spark.sql.orc.impl</code></td>
    +    <td><code>native</code></td>
    +    <td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.</td>
    +  </tr>
    +  <tr>
    +    <td><code>spark.sql.orc.enableVectorizedReader</code></td>
    +    <td><code>true</code></td>
    +    <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
    +  </tr>
    +</table>
    +
    +## JSON Datasets
    --- End diff --
    
    We support a typical JSON file, don't we?
    > For a regular multi-line JSON file, set the `multiLine` option to `true`.
    
    IMO, that notice means we provides more flexibility.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226241683
  
    --- Diff: docs/sql-distributed-sql-engine.md ---
    @@ -0,0 +1,85 @@
    +---
    +layout: global
    +title: Distributed SQL Engine
    +displayTitle: Distributed SQL Engine
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface.
    +In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries,
    +without the need to write any code.
    +
    +## Running the Thrift JDBC/ODBC server
    +
    +The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
    +in Hive 1.2.1. You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1.
    +
    +To start the JDBC/ODBC server, run the following in the Spark directory:
    +
    +    ./sbin/start-thriftserver.sh
    +
    +This script accepts all `bin/spark-submit` command line options, plus a `--hiveconf` option to
    +specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` for a complete list of
    +all available options. By default, the server listens on localhost:10000. You may override this
    +behaviour via either environment variables, i.e.:
    +
    +{% highlight bash %}
    +export HIVE_SERVER2_THRIFT_PORT=<listening-port>
    +export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
    +./sbin/start-thriftserver.sh \
    +  --master <master-uri> \
    +  ...
    +{% endhighlight %}
    +
    +or system properties:
    +
    +{% highlight bash %}
    +./sbin/start-thriftserver.sh \
    +  --hiveconf hive.server2.thrift.port=<listening-port> \
    +  --hiveconf hive.server2.thrift.bind.host=<listening-host> \
    +  --master <master-uri>
    +  ...
    +{% endhighlight %}
    +
    +Now you can use beeline to test the Thrift JDBC/ODBC server:
    +
    +    ./bin/beeline
    +
    +Connect to the JDBC/ODBC server in beeline with:
    +
    +    beeline> !connect jdbc:hive2://localhost:10000
    +
    +Beeline will ask you for a username and password. In non-secure mode, simply enter the username on
    +your machine and a blank password. For secure mode, please follow the instructions given in the
    +[beeline documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients).
    +
    +Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.
    +
    +You may also use the beeline script that comes with Hive.
    +
    +Thrift JDBC server also supports sending thrift RPC messages over HTTP transport.
    +Use the following setting to enable HTTP mode as system property or in `hive-site.xml` file in `conf/`:
    +
    +    hive.server2.transport.mode - Set this to value: http
    +    hive.server2.thrift.http.port - HTTP port number to listen on; default is 10001
    +    hive.server2.http.endpoint - HTTP endpoint; default is cliservice
    +
    +To test, use beeline to connect to the JDBC/ODBC server in http mode with:
    +
    +    beeline> !connect jdbc:hive2://<host>:<port>/<database>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=<http_endpoint>
    +
    +
    +## Running the Spark SQL CLI
    +
    +The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute
    +queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.
    +
    +To start the Spark SQL CLI, run the following in the Spark directory:
    +
    +    ./bin/spark-sql
    +
    +Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.
    +You may run `./bin/spark-sql --help` for a complete list of all available
    +options.
    --- End diff --
    
    super nit: this line can be concatenated with the previous line.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226288610
  
    --- Diff: docs/sql-data-sources-other.md ---
    @@ -0,0 +1,114 @@
    +---
    +layout: global
    +title: Other Data Sources
    +displayTitle: Other Data Sources
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## ORC Files
    +
    +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files.
    +To do that, the following configurations are newly added. The vectorized reader is used for the
    +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl`
    +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC
    +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
    +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`.
    +
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td><code>spark.sql.orc.impl</code></td>
    +    <td><code>native</code></td>
    +    <td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.</td>
    +  </tr>
    +  <tr>
    +    <td><code>spark.sql.orc.enableVectorizedReader</code></td>
    +    <td><code>true</code></td>
    +    <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
    +  </tr>
    +</table>
    +
    +## JSON Datasets
    --- End diff --
    
    Done in 17995f9. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97535/testReport)** for PR 22746 at commit [`17995f9`](https://github.com/apache/spark/commit/17995f92bf4f7b6831f129b558669346a5eafedf).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226202492
  
    --- Diff: docs/sql-data-sources-other.md ---
    @@ -0,0 +1,114 @@
    +---
    +layout: global
    +title: Other Data Sources
    +displayTitle: Other Data Sources
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## ORC Files
    +
    +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files.
    +To do that, the following configurations are newly added. The vectorized reader is used for the
    +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl`
    +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC
    +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
    +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`.
    +
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td><code>spark.sql.orc.impl</code></td>
    +    <td><code>native</code></td>
    +    <td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.</td>
    +  </tr>
    +  <tr>
    +    <td><code>spark.sql.orc.enableVectorizedReader</code></td>
    +    <td><code>true</code></td>
    +    <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
    +  </tr>
    +</table>
    +
    +## JSON Datasets
    --- End diff --
    
    Maybe keep `Datasets`? As the below description `Note that the file that is offered as a json file is not a typical JSON file`. WDYT?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r225780740
  
    --- Diff: docs/sql-getting-started.md ---
    @@ -0,0 +1,369 @@
    +---
    +layout: global
    +title: Getting Started
    +displayTitle: Getting Started
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Starting Point: SparkSession
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
    +
    +{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
    +
    +{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
    +
    +{% include_example init_session python/sql/basic.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
    +
    +{% include_example init_session r/RSparkSQLExample.R %}
    +
    +Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around.
    +</div>
    +</div>
    +
    +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
    +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
    +To use these features, you do not need to have an existing Hive setup.
    +
    +## Creating DataFrames
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
    +from a Hive table, or from [Spark data sources](#data-sources).
    --- End diff --
    
    The link `[Spark data sources](#data-sources)` does not work after this change. Could you fix all the similar cases? Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    This is very cool! thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97500/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4090/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226226005
  
    --- Diff: docs/sql-data-sources-other.md ---
    @@ -0,0 +1,114 @@
    +---
    +layout: global
    +title: Other Data Sources
    +displayTitle: Other Data Sources
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## ORC Files
    +
    +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files.
    +To do that, the following configurations are newly added. The vectorized reader is used for the
    +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl`
    +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC
    +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
    +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`.
    +
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td><code>spark.sql.orc.impl</code></td>
    +    <td><code>native</code></td>
    +    <td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.</td>
    +  </tr>
    +  <tr>
    +    <td><code>spark.sql.orc.enableVectorizedReader</code></td>
    +    <td><code>true</code></td>
    +    <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
    +  </tr>
    +</table>
    +
    +## JSON Datasets
    --- End diff --
    
    Got it, will change it soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    @kiszk Great thanks for all the detailed check, addressed in 17995f9. Also double checked by grep the typo for each error you found.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r225794477
  
    --- Diff: docs/sql-getting-started.md ---
    @@ -0,0 +1,369 @@
    +---
    +layout: global
    +title: Getting Started
    +displayTitle: Getting Started
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Starting Point: SparkSession
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
    +
    +{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
    +
    +{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
    +
    +{% include_example init_session python/sql/basic.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
    +
    +{% include_example init_session r/RSparkSQLExample.R %}
    +
    +Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around.
    +</div>
    +</div>
    +
    +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
    +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
    +To use these features, you do not need to have an existing Hive setup.
    +
    +## Creating DataFrames
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
    +from a Hive table, or from [Spark data sources](#data-sources).
    --- End diff --
    
    Done in 58115e5, also fix link in ml-pipeline.md\sparkr.md\structured-streaming-programming-guide.md


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by gengliangwang <gi...@git.apache.org>.

Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r225921716
  
    --- Diff: docs/sql-reference.md ---
    @@ -0,0 +1,641 @@
    +---
    +layout: global
    +title: Reference
    +displayTitle: Reference
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Data Types
    +
    +Spark SQL and DataFrames support the following data types:
    +
    +* Numeric types
    +  - `ByteType`: Represents 1-byte signed integer numbers.
    +  The range of numbers is from `-128` to `127`.
    +  - `ShortType`: Represents 2-byte signed integer numbers.
    +  The range of numbers is from `-32768` to `32767`.
    +  - `IntegerType`: Represents 4-byte signed integer numbers.
    +  The range of numbers is from `-2147483648` to `2147483647`.
    +  - `LongType`: Represents 8-byte signed integer numbers.
    +  The range of numbers is from `-9223372036854775808` to `9223372036854775807`.
    +  - `FloatType`: Represents 4-byte single-precision floating point numbers.
    +  - `DoubleType`: Represents 8-byte double-precision floating point numbers.
    +  - `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
    +* String type
    +  - `StringType`: Represents character string values.
    +* Binary type
    +  - `BinaryType`: Represents byte sequence values.
    +* Boolean type
    +  - `BooleanType`: Represents boolean values.
    +* Datetime type
    +  - `TimestampType`: Represents values comprising values of fields year, month, day,
    +  hour, minute, and second.
    +  - `DateType`: Represents values comprising values of fields year, month, day.
    +* Complex types
    +  - `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of
    +  elements with the type of `elementType`. `containsNull` is used to indicate if
    +  elements in a `ArrayType` value can have `null` values.
    +  - `MapType(keyType, valueType, valueContainsNull)`:
    +  Represents values comprising a set of key-value pairs. The data type of keys are
    +  described by `keyType` and the data type of values are described by `valueType`.
    +  For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull`
    +  is used to indicate if values of a `MapType` value can have `null` values.
    +  - `StructType(fields)`: Represents values with the structure described by
    +  a sequence of `StructField`s (`fields`).
    +    * `StructField(name, dataType, nullable)`: Represents a field in a `StructType`.
    +    The name of a field is indicated by `name`. The data type of a field is indicated
    +    by `dataType`. `nullable` is used to indicate if values of this fields can have
    +    `null` values.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +All data types of Spark SQL are located in the package `org.apache.spark.sql.types`.
    +You can access them by doing
    +
    +{% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
    +
    +<table class="table">
    +<tr>
    +  <th style="width:20%">Data type</th>
    +  <th style="width:40%">Value type in Scala</th>
    +  <th>API to access or create a data type</th></tr>
    +<tr>
    +  <td> <b>ByteType</b> </td>
    +  <td> Byte </td>
    +  <td>
    +  ByteType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ShortType</b> </td>
    +  <td> Short </td>
    +  <td>
    +  ShortType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>IntegerType</b> </td>
    +  <td> Int </td>
    +  <td>
    +  IntegerType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>LongType</b> </td>
    +  <td> Long </td>
    +  <td>
    +  LongType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>FloatType</b> </td>
    +  <td> Float </td>
    +  <td>
    +  FloatType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DoubleType</b> </td>
    +  <td> Double </td>
    +  <td>
    +  DoubleType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DecimalType</b> </td>
    +  <td> java.math.BigDecimal </td>
    +  <td>
    +  DecimalType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StringType</b> </td>
    +  <td> String </td>
    +  <td>
    +  StringType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BinaryType</b> </td>
    +  <td> Array[Byte] </td>
    +  <td>
    +  BinaryType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BooleanType</b> </td>
    +  <td> Boolean </td>
    +  <td>
    +  BooleanType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>TimestampType</b> </td>
    +  <td> java.sql.Timestamp </td>
    +  <td>
    +  TimestampType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DateType</b> </td>
    +  <td> java.sql.Date </td>
    +  <td>
    +  DateType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ArrayType</b> </td>
    +  <td> scala.collection.Seq </td>
    +  <td>
    +  ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>MapType</b> </td>
    +  <td> scala.collection.Map </td>
    +  <td>
    +  MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructType</b> </td>
    +  <td> org.apache.spark.sql.Row </td>
    +  <td>
    +  StructType(<i>fields</i>)<br />
    +  <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
    +  name are not allowed.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructField</b> </td>
    +  <td> The value type in Scala of the data type of this field
    +  (For example, Int for a StructField with the data type IntegerType) </td>
    +  <td>
    +  StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br />
    +  <b>Note:</b> The default value of <i>nullable</i> is <i>true</i>.
    +  </td>
    +</tr>
    +</table>
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +All data types of Spark SQL are located in the package of
    +`org.apache.spark.sql.types`. To access or create a data type,
    +please use factory methods provided in
    +`org.apache.spark.sql.types.DataTypes`.
    +
    +<table class="table">
    +<tr>
    +  <th style="width:20%">Data type</th>
    +  <th style="width:40%">Value type in Java</th>
    +  <th>API to access or create a data type</th></tr>
    +<tr>
    +  <td> <b>ByteType</b> </td>
    +  <td> byte or Byte </td>
    +  <td>
    +  DataTypes.ByteType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ShortType</b> </td>
    +  <td> short or Short </td>
    +  <td>
    +  DataTypes.ShortType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>IntegerType</b> </td>
    +  <td> int or Integer </td>
    +  <td>
    +  DataTypes.IntegerType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>LongType</b> </td>
    +  <td> long or Long </td>
    +  <td>
    +  DataTypes.LongType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>FloatType</b> </td>
    +  <td> float or Float </td>
    +  <td>
    +  DataTypes.FloatType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DoubleType</b> </td>
    +  <td> double or Double </td>
    +  <td>
    +  DataTypes.DoubleType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DecimalType</b> </td>
    +  <td> java.math.BigDecimal </td>
    +  <td>
    +  DataTypes.createDecimalType()<br />
    +  DataTypes.createDecimalType(<i>precision</i>, <i>scale</i>).
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StringType</b> </td>
    +  <td> String </td>
    +  <td>
    +  DataTypes.StringType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BinaryType</b> </td>
    +  <td> byte[] </td>
    +  <td>
    +  DataTypes.BinaryType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BooleanType</b> </td>
    +  <td> boolean or Boolean </td>
    +  <td>
    +  DataTypes.BooleanType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>TimestampType</b> </td>
    +  <td> java.sql.Timestamp </td>
    +  <td>
    +  DataTypes.TimestampType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DateType</b> </td>
    +  <td> java.sql.Date </td>
    +  <td>
    +  DataTypes.DateType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ArrayType</b> </td>
    +  <td> java.util.List </td>
    +  <td>
    +  DataTypes.createArrayType(<i>elementType</i>)<br />
    +  <b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br />
    +  DataTypes.createArrayType(<i>elementType</i>, <i>containsNull</i>).
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>MapType</b> </td>
    +  <td> java.util.Map </td>
    +  <td>
    +  DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>)<br />
    +  <b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br />
    +  DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br />
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructType</b> </td>
    +  <td> org.apache.spark.sql.Row </td>
    +  <td>
    +  DataTypes.createStructType(<i>fields</i>)<br />
    +  <b>Note:</b> <i>fields</i> is a List or an array of StructFields.
    +  Also, two fields with the same name are not allowed.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructField</b> </td>
    +  <td> The value type in Java of the data type of this field
    +  (For example, int for a StructField with the data type IntegerType) </td>
    +  <td>
    +  DataTypes.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>)
    +  </td>
    +</tr>
    +</table>
    +
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +All data types of Spark SQL are located in the package of `pyspark.sql.types`.
    +You can access them by doing
    +{% highlight python %}
    +from pyspark.sql.types import *
    +{% endhighlight %}
    +
    +<table class="table">
    +<tr>
    +  <th style="width:20%">Data type</th>
    +  <th style="width:40%">Value type in Python</th>
    +  <th>API to access or create a data type</th></tr>
    +<tr>
    +  <td> <b>ByteType</b> </td>
    +  <td>
    +  int or long <br />
    +  <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of -128 to 127.
    +  </td>
    +  <td>
    +  ByteType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ShortType</b> </td>
    +  <td>
    +  int or long <br />
    +  <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of -32768 to 32767.
    +  </td>
    +  <td>
    +  ShortType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>IntegerType</b> </td>
    +  <td> int or long </td>
    +  <td>
    +  IntegerType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>LongType</b> </td>
    +  <td>
    +  long <br />
    +  <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of
    +  -9223372036854775808 to 9223372036854775807.
    +  Otherwise, please convert data to decimal.Decimal and use DecimalType.
    +  </td>
    +  <td>
    +  LongType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>FloatType</b> </td>
    +  <td>
    +  float <br />
    +  <b>Note:</b> Numbers will be converted to 4-byte single-precision floating
    +  point numbers at runtime.
    +  </td>
    +  <td>
    +  FloatType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DoubleType</b> </td>
    +  <td> float </td>
    +  <td>
    +  DoubleType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DecimalType</b> </td>
    +  <td> decimal.Decimal </td>
    +  <td>
    +  DecimalType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StringType</b> </td>
    +  <td> string </td>
    +  <td>
    +  StringType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BinaryType</b> </td>
    +  <td> bytearray </td>
    +  <td>
    +  BinaryType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BooleanType</b> </td>
    +  <td> bool </td>
    +  <td>
    +  BooleanType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>TimestampType</b> </td>
    +  <td> datetime.datetime </td>
    +  <td>
    +  TimestampType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DateType</b> </td>
    +  <td> datetime.date </td>
    +  <td>
    +  DateType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ArrayType</b> </td>
    +  <td> list, tuple, or array </td>
    +  <td>
    +  ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>MapType</b> </td>
    +  <td> dict </td>
    +  <td>
    +  MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructType</b> </td>
    +  <td> list or tuple </td>
    +  <td>
    +  StructType(<i>fields</i>)<br />
    +  <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
    +  name are not allowed.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructField</b> </td>
    +  <td> The value type in Python of the data type of this field
    +  (For example, Int for a StructField with the data type IntegerType) </td>
    +  <td>
    +  StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br />
    +  <b>Note:</b> The default value of <i>nullable</i> is <i>True</i>.
    +  </td>
    +</tr>
    +</table>
    +
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +<table class="table">
    +<tr>
    +  <th style="width:20%">Data type</th>
    +  <th style="width:40%">Value type in R</th>
    +  <th>API to access or create a data type</th></tr>
    +<tr>
    +  <td> <b>ByteType</b> </td>
    +  <td>
    +  integer <br />
    +  <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of -128 to 127.
    +  </td>
    +  <td>
    +  "byte"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ShortType</b> </td>
    +  <td>
    +  integer <br />
    +  <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of -32768 to 32767.
    +  </td>
    +  <td>
    +  "short"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>IntegerType</b> </td>
    +  <td> integer </td>
    +  <td>
    +  "integer"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>LongType</b> </td>
    +  <td>
    +  integer <br />
    +  <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of
    +  -9223372036854775808 to 9223372036854775807.
    +  Otherwise, please convert data to decimal.Decimal and use DecimalType.
    +  </td>
    +  <td>
    +  "long"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>FloatType</b> </td>
    +  <td>
    +  numeric <br />
    +  <b>Note:</b> Numbers will be converted to 4-byte single-precision floating
    +  point numbers at runtime.
    +  </td>
    +  <td>
    +  "float"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DoubleType</b> </td>
    +  <td> numeric </td>
    +  <td>
    +  "double"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DecimalType</b> </td>
    +  <td> Not supported </td>
    +  <td>
    +   Not supported
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StringType</b> </td>
    +  <td> character </td>
    +  <td>
    +  "string"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BinaryType</b> </td>
    +  <td> raw </td>
    +  <td>
    +  "binary"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BooleanType</b> </td>
    +  <td> logical </td>
    +  <td>
    +  "bool"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>TimestampType</b> </td>
    +  <td> POSIXct </td>
    +  <td>
    +  "timestamp"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DateType</b> </td>
    +  <td> Date </td>
    +  <td>
    +  "date"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ArrayType</b> </td>
    +  <td> vector or list </td>
    +  <td>
    +  list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>containsNull</i> is <i>TRUE</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>MapType</b> </td>
    +  <td> environment </td>
    +  <td>
    +  list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>TRUE</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructType</b> </td>
    +  <td> named list</td>
    +  <td>
    +  list(type="struct", fields=<i>fields</i>)<br />
    +  <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
    +  name are not allowed.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructField</b> </td>
    +  <td> The value type in R of the data type of this field
    +  (For example, integer for a StructField with the data type IntegerType) </td>
    +  <td>
    +  list(name=<i>name</i>, type=<i>dataType</i>, nullable=[<i>nullable</i>])<br />
    +  <b>Note:</b> The default value of <i>nullable</i> is <i>TRUE</i>.
    +  </td>
    +</tr>
    +</table>
    +
    +</div>
    +
    +</div>
    +
    +## NaN Semantics
    +
    +There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
    +does not exactly match standard floating point semantics.
    +Specifically:
    +
    + - NaN = NaN returns true.
    + - In aggregations, all NaN values are grouped together.
    + - NaN is treated as a normal value in join keys.
    + - NaN values go last when in ascending order, larger than any other numeric value.
    + 
    + ## Arithmetic operations
    --- End diff --
    
    The space indent here is wrong.
    
    ![image](https://user-images.githubusercontent.com/1097932/47088617-7f04b900-d251-11e8-82db-c8762f80b9a7.png)



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97482/testReport)** for PR 22746 at commit [`58115e5`](https://github.com/apache/spark/commit/58115e5a69670f45cf05d2026cb57abb595fe073).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226011439
  
    --- Diff: docs/sql-reference.md ---
    @@ -0,0 +1,641 @@
    +---
    +layout: global
    +title: Reference
    +displayTitle: Reference
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Data Types
    +
    +Spark SQL and DataFrames support the following data types:
    +
    +* Numeric types
    +  - `ByteType`: Represents 1-byte signed integer numbers.
    +  The range of numbers is from `-128` to `127`.
    +  - `ShortType`: Represents 2-byte signed integer numbers.
    +  The range of numbers is from `-32768` to `32767`.
    +  - `IntegerType`: Represents 4-byte signed integer numbers.
    +  The range of numbers is from `-2147483648` to `2147483647`.
    +  - `LongType`: Represents 8-byte signed integer numbers.
    +  The range of numbers is from `-9223372036854775808` to `9223372036854775807`.
    +  - `FloatType`: Represents 4-byte single-precision floating point numbers.
    +  - `DoubleType`: Represents 8-byte double-precision floating point numbers.
    +  - `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
    +* String type
    +  - `StringType`: Represents character string values.
    +* Binary type
    +  - `BinaryType`: Represents byte sequence values.
    +* Boolean type
    +  - `BooleanType`: Represents boolean values.
    +* Datetime type
    +  - `TimestampType`: Represents values comprising values of fields year, month, day,
    +  hour, minute, and second.
    +  - `DateType`: Represents values comprising values of fields year, month, day.
    +* Complex types
    +  - `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of
    +  elements with the type of `elementType`. `containsNull` is used to indicate if
    +  elements in a `ArrayType` value can have `null` values.
    +  - `MapType(keyType, valueType, valueContainsNull)`:
    +  Represents values comprising a set of key-value pairs. The data type of keys are
    +  described by `keyType` and the data type of values are described by `valueType`.
    +  For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull`
    +  is used to indicate if values of a `MapType` value can have `null` values.
    +  - `StructType(fields)`: Represents values with the structure described by
    +  a sequence of `StructField`s (`fields`).
    +    * `StructField(name, dataType, nullable)`: Represents a field in a `StructType`.
    +    The name of a field is indicated by `name`. The data type of a field is indicated
    +    by `dataType`. `nullable` is used to indicate if values of this fields can have
    +    `null` values.
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +All data types of Spark SQL are located in the package `org.apache.spark.sql.types`.
    +You can access them by doing
    +
    +{% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
    +
    +<table class="table">
    +<tr>
    +  <th style="width:20%">Data type</th>
    +  <th style="width:40%">Value type in Scala</th>
    +  <th>API to access or create a data type</th></tr>
    +<tr>
    +  <td> <b>ByteType</b> </td>
    +  <td> Byte </td>
    +  <td>
    +  ByteType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ShortType</b> </td>
    +  <td> Short </td>
    +  <td>
    +  ShortType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>IntegerType</b> </td>
    +  <td> Int </td>
    +  <td>
    +  IntegerType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>LongType</b> </td>
    +  <td> Long </td>
    +  <td>
    +  LongType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>FloatType</b> </td>
    +  <td> Float </td>
    +  <td>
    +  FloatType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DoubleType</b> </td>
    +  <td> Double </td>
    +  <td>
    +  DoubleType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DecimalType</b> </td>
    +  <td> java.math.BigDecimal </td>
    +  <td>
    +  DecimalType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StringType</b> </td>
    +  <td> String </td>
    +  <td>
    +  StringType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BinaryType</b> </td>
    +  <td> Array[Byte] </td>
    +  <td>
    +  BinaryType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BooleanType</b> </td>
    +  <td> Boolean </td>
    +  <td>
    +  BooleanType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>TimestampType</b> </td>
    +  <td> java.sql.Timestamp </td>
    +  <td>
    +  TimestampType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DateType</b> </td>
    +  <td> java.sql.Date </td>
    +  <td>
    +  DateType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ArrayType</b> </td>
    +  <td> scala.collection.Seq </td>
    +  <td>
    +  ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>MapType</b> </td>
    +  <td> scala.collection.Map </td>
    +  <td>
    +  MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructType</b> </td>
    +  <td> org.apache.spark.sql.Row </td>
    +  <td>
    +  StructType(<i>fields</i>)<br />
    +  <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
    +  name are not allowed.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructField</b> </td>
    +  <td> The value type in Scala of the data type of this field
    +  (For example, Int for a StructField with the data type IntegerType) </td>
    +  <td>
    +  StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br />
    +  <b>Note:</b> The default value of <i>nullable</i> is <i>true</i>.
    +  </td>
    +</tr>
    +</table>
    +
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +All data types of Spark SQL are located in the package of
    +`org.apache.spark.sql.types`. To access or create a data type,
    +please use factory methods provided in
    +`org.apache.spark.sql.types.DataTypes`.
    +
    +<table class="table">
    +<tr>
    +  <th style="width:20%">Data type</th>
    +  <th style="width:40%">Value type in Java</th>
    +  <th>API to access or create a data type</th></tr>
    +<tr>
    +  <td> <b>ByteType</b> </td>
    +  <td> byte or Byte </td>
    +  <td>
    +  DataTypes.ByteType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ShortType</b> </td>
    +  <td> short or Short </td>
    +  <td>
    +  DataTypes.ShortType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>IntegerType</b> </td>
    +  <td> int or Integer </td>
    +  <td>
    +  DataTypes.IntegerType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>LongType</b> </td>
    +  <td> long or Long </td>
    +  <td>
    +  DataTypes.LongType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>FloatType</b> </td>
    +  <td> float or Float </td>
    +  <td>
    +  DataTypes.FloatType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DoubleType</b> </td>
    +  <td> double or Double </td>
    +  <td>
    +  DataTypes.DoubleType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DecimalType</b> </td>
    +  <td> java.math.BigDecimal </td>
    +  <td>
    +  DataTypes.createDecimalType()<br />
    +  DataTypes.createDecimalType(<i>precision</i>, <i>scale</i>).
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StringType</b> </td>
    +  <td> String </td>
    +  <td>
    +  DataTypes.StringType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BinaryType</b> </td>
    +  <td> byte[] </td>
    +  <td>
    +  DataTypes.BinaryType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BooleanType</b> </td>
    +  <td> boolean or Boolean </td>
    +  <td>
    +  DataTypes.BooleanType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>TimestampType</b> </td>
    +  <td> java.sql.Timestamp </td>
    +  <td>
    +  DataTypes.TimestampType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DateType</b> </td>
    +  <td> java.sql.Date </td>
    +  <td>
    +  DataTypes.DateType
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ArrayType</b> </td>
    +  <td> java.util.List </td>
    +  <td>
    +  DataTypes.createArrayType(<i>elementType</i>)<br />
    +  <b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br />
    +  DataTypes.createArrayType(<i>elementType</i>, <i>containsNull</i>).
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>MapType</b> </td>
    +  <td> java.util.Map </td>
    +  <td>
    +  DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>)<br />
    +  <b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br />
    +  DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br />
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructType</b> </td>
    +  <td> org.apache.spark.sql.Row </td>
    +  <td>
    +  DataTypes.createStructType(<i>fields</i>)<br />
    +  <b>Note:</b> <i>fields</i> is a List or an array of StructFields.
    +  Also, two fields with the same name are not allowed.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructField</b> </td>
    +  <td> The value type in Java of the data type of this field
    +  (For example, int for a StructField with the data type IntegerType) </td>
    +  <td>
    +  DataTypes.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>)
    +  </td>
    +</tr>
    +</table>
    +
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +All data types of Spark SQL are located in the package of `pyspark.sql.types`.
    +You can access them by doing
    +{% highlight python %}
    +from pyspark.sql.types import *
    +{% endhighlight %}
    +
    +<table class="table">
    +<tr>
    +  <th style="width:20%">Data type</th>
    +  <th style="width:40%">Value type in Python</th>
    +  <th>API to access or create a data type</th></tr>
    +<tr>
    +  <td> <b>ByteType</b> </td>
    +  <td>
    +  int or long <br />
    +  <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of -128 to 127.
    +  </td>
    +  <td>
    +  ByteType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ShortType</b> </td>
    +  <td>
    +  int or long <br />
    +  <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of -32768 to 32767.
    +  </td>
    +  <td>
    +  ShortType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>IntegerType</b> </td>
    +  <td> int or long </td>
    +  <td>
    +  IntegerType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>LongType</b> </td>
    +  <td>
    +  long <br />
    +  <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of
    +  -9223372036854775808 to 9223372036854775807.
    +  Otherwise, please convert data to decimal.Decimal and use DecimalType.
    +  </td>
    +  <td>
    +  LongType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>FloatType</b> </td>
    +  <td>
    +  float <br />
    +  <b>Note:</b> Numbers will be converted to 4-byte single-precision floating
    +  point numbers at runtime.
    +  </td>
    +  <td>
    +  FloatType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DoubleType</b> </td>
    +  <td> float </td>
    +  <td>
    +  DoubleType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DecimalType</b> </td>
    +  <td> decimal.Decimal </td>
    +  <td>
    +  DecimalType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StringType</b> </td>
    +  <td> string </td>
    +  <td>
    +  StringType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BinaryType</b> </td>
    +  <td> bytearray </td>
    +  <td>
    +  BinaryType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BooleanType</b> </td>
    +  <td> bool </td>
    +  <td>
    +  BooleanType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>TimestampType</b> </td>
    +  <td> datetime.datetime </td>
    +  <td>
    +  TimestampType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DateType</b> </td>
    +  <td> datetime.date </td>
    +  <td>
    +  DateType()
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ArrayType</b> </td>
    +  <td> list, tuple, or array </td>
    +  <td>
    +  ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>MapType</b> </td>
    +  <td> dict </td>
    +  <td>
    +  MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructType</b> </td>
    +  <td> list or tuple </td>
    +  <td>
    +  StructType(<i>fields</i>)<br />
    +  <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
    +  name are not allowed.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructField</b> </td>
    +  <td> The value type in Python of the data type of this field
    +  (For example, Int for a StructField with the data type IntegerType) </td>
    +  <td>
    +  StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br />
    +  <b>Note:</b> The default value of <i>nullable</i> is <i>True</i>.
    +  </td>
    +</tr>
    +</table>
    +
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +<table class="table">
    +<tr>
    +  <th style="width:20%">Data type</th>
    +  <th style="width:40%">Value type in R</th>
    +  <th>API to access or create a data type</th></tr>
    +<tr>
    +  <td> <b>ByteType</b> </td>
    +  <td>
    +  integer <br />
    +  <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of -128 to 127.
    +  </td>
    +  <td>
    +  "byte"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ShortType</b> </td>
    +  <td>
    +  integer <br />
    +  <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of -32768 to 32767.
    +  </td>
    +  <td>
    +  "short"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>IntegerType</b> </td>
    +  <td> integer </td>
    +  <td>
    +  "integer"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>LongType</b> </td>
    +  <td>
    +  integer <br />
    +  <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
    +  Please make sure that numbers are within the range of
    +  -9223372036854775808 to 9223372036854775807.
    +  Otherwise, please convert data to decimal.Decimal and use DecimalType.
    +  </td>
    +  <td>
    +  "long"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>FloatType</b> </td>
    +  <td>
    +  numeric <br />
    +  <b>Note:</b> Numbers will be converted to 4-byte single-precision floating
    +  point numbers at runtime.
    +  </td>
    +  <td>
    +  "float"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DoubleType</b> </td>
    +  <td> numeric </td>
    +  <td>
    +  "double"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DecimalType</b> </td>
    +  <td> Not supported </td>
    +  <td>
    +   Not supported
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StringType</b> </td>
    +  <td> character </td>
    +  <td>
    +  "string"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BinaryType</b> </td>
    +  <td> raw </td>
    +  <td>
    +  "binary"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>BooleanType</b> </td>
    +  <td> logical </td>
    +  <td>
    +  "bool"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>TimestampType</b> </td>
    +  <td> POSIXct </td>
    +  <td>
    +  "timestamp"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>DateType</b> </td>
    +  <td> Date </td>
    +  <td>
    +  "date"
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>ArrayType</b> </td>
    +  <td> vector or list </td>
    +  <td>
    +  list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>containsNull</i> is <i>TRUE</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>MapType</b> </td>
    +  <td> environment </td>
    +  <td>
    +  list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br />
    +  <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>TRUE</i>.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructType</b> </td>
    +  <td> named list</td>
    +  <td>
    +  list(type="struct", fields=<i>fields</i>)<br />
    +  <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
    +  name are not allowed.
    +  </td>
    +</tr>
    +<tr>
    +  <td> <b>StructField</b> </td>
    +  <td> The value type in R of the data type of this field
    +  (For example, integer for a StructField with the data type IntegerType) </td>
    +  <td>
    +  list(name=<i>name</i>, type=<i>dataType</i>, nullable=[<i>nullable</i>])<br />
    +  <b>Note:</b> The default value of <i>nullable</i> is <i>TRUE</i>.
    +  </td>
    +</tr>
    +</table>
    +
    +</div>
    +
    +</div>
    +
    +## NaN Semantics
    +
    +There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
    +does not exactly match standard floating point semantics.
    +Specifically:
    +
    + - NaN = NaN returns true.
    + - In aggregations, all NaN values are grouped together.
    + - NaN is treated as a normal value in join keys.
    + - NaN values go last when in ascending order, larger than any other numeric value.
    + 
    + ## Arithmetic operations
    --- End diff --
    
    ah thanks! Fix in b3fc39d.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r225789933
  
    --- Diff: docs/sql-getting-started.md ---
    @@ -0,0 +1,369 @@
    +---
    +layout: global
    +title: Getting Started
    +displayTitle: Getting Started
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Starting Point: SparkSession
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
    +
    +{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
    +
    +{% include_example init_session java/org/apache/spark/examples/sql/JavaSparkSQLExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder`:
    +
    +{% include_example init_session python/sql/basic.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
    +
    +{% include_example init_session r/RSparkSQLExample.R %}
    +
    +Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around.
    +</div>
    +</div>
    +
    +`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
    +write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
    +To use these features, you do not need to have an existing Hive setup.
    +
    +## Creating DataFrames
    +
    +<div class="codetabs">
    +<div data-lang="scala"  markdown="1">
    +With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
    +from a Hive table, or from [Spark data sources](#data-sources).
    --- End diff --
    
    Sorry for the missing, will check all inner link by `<a href="# ` in generated html.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226245945
  
    --- Diff: docs/sql-migration-guide-upgrade.md ---
    @@ -0,0 +1,520 @@
    +---
    +layout: global
    +title: Spark SQL Upgrading Guide
    +displayTitle: Spark SQL Upgrading Guide
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
    --- End diff --
    
    `the builder come` -> `the builder comes`?
    cc @ueshin


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226191366
  
    --- Diff: docs/sql-distributed-sql-engine.md ---
    @@ -0,0 +1,85 @@
    +---
    +layout: global
    +title: Distributed SQL Engine
    +displayTitle: Distributed SQL Engine
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface.
    +In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries,
    +without the need to write any code.
    +
    +## Running the Thrift JDBC/ODBC server
    +
    +The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
    +in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1.
    --- End diff --
    
    nit. `1.2.1 You` -> `1.2.1. You`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226231876
  
    --- Diff: docs/sql-data-sources-jdbc.md ---
    @@ -0,0 +1,223 @@
    +---
    +layout: global
    +title: JDBC To Other Databases
    +displayTitle: JDBC To Other Databases
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +Spark SQL also includes a data source that can read data from other databases using JDBC. This
    +functionality should be preferred over using [JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD).
    +This is because the results are returned
    +as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
    +The JDBC data source is also easier to use from Java or Python as it does not require the user to
    +provide a ClassTag.
    +(Note that this is different than the Spark SQL JDBC server, which allows other applications to
    +run queries using Spark SQL).
    +
    +To get started you will need to include the JDBC driver for your particular database on the
    +spark classpath. For example, to connect to postgres from the Spark Shell you would run the
    +following command:
    +
    +{% highlight bash %}
    +bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
    +{% endhighlight %}
    +
    +Tables from the remote database can be loaded as a DataFrame or Spark SQL temporary view using
    +the Data Sources API. Users can specify the JDBC connection properties in the data source options.
    +<code>user</code> and <code>password</code> are normally provided as connection properties for
    +logging into the data sources. In addition to the connection properties, Spark also supports
    +the following case-insensitive options:
    +
    +<table class="table">
    +  <tr><th>Property Name</th><th>Meaning</th></tr>
    +  <tr>
    +    <td><code>url</code></td>
    +    <td>
    +      The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. e.g., <code>jdbc:postgresql://localhost/test?user=fred&password=secret</code>
    +    </td>
    +  </tr>
    +
    +  <tr>
    +    <td><code>dbtable</code></td>
    +    <td>
    +      The JDBC table that should be read from or written into. Note that when using it in the read
    +      path anything that is valid in a <code>FROM</code> clause of a SQL query can be used.
    +      For example, instead of a full table you could also use a subquery in parentheses. It is not
    +      allowed to specify `dbtable` and `query` options at the same time.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>query</code></td>
    +    <td>
    +      A query that will be used to read data into Spark. The specified query will be parenthesized and used
    +      as a subquery in the <code>FROM</code> clause. Spark will also assign an alias to the subquery clause.
    +      As an example, spark will issue a query of the following form to the JDBC Source.<br><br>
    +      <code> SELECT &lt;columns&gt; FROM (&lt;user_specified_query&gt;) spark_gen_alias</code><br><br>
    +      Below are couple of restrictions while using this option.<br>
    +      <ol>
    +         <li> It is not allowed to specify `dbtable` and `query` options at the same time. </li>
    +         <li> It is not allowed to spcify `query` and `partitionColumn` options at the same time. When specifying
    --- End diff --
    
    `spcify` -> `specify`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226208608
  
    --- Diff: docs/sql-distributed-sql-engine.md ---
    @@ -0,0 +1,85 @@
    +---
    +layout: global
    +title: Distributed SQL Engine
    +displayTitle: Distributed SQL Engine
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface.
    +In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries,
    +without the need to write any code.
    +
    +## Running the Thrift JDBC/ODBC server
    +
    +The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
    +in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1.
    --- End diff --
    
    Thanks, done in 27b066d.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226237047
  
    --- Diff: docs/sql-data-sources-parquet.md ---
    @@ -0,0 +1,321 @@
    +---
    +layout: global
    +title: Parquet Files
    +displayTitle: Parquet Files
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems.
    +Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema
    +of the original data. When writing Parquet files, all columns are automatically converted to be nullable for
    +compatibility reasons.
    +
    +### Loading Data Programmatically
    +
    +Using the data from the above example:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example basic_parquet_example scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example basic_parquet_example java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +{% include_example basic_parquet_example python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +{% include_example basic_parquet_example r/RSparkSQLExample.R %}
    +
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TEMPORARY VIEW parquetTable
    +USING org.apache.spark.sql.parquet
    +OPTIONS (
    +  path "examples/src/main/resources/people.parquet"
    +)
    +
    +SELECT * FROM parquetTable
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +### Partition Discovery
    +
    +Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
    +table, data are usually stored in different directories, with partitioning column values encoded in
    +the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet)
    +are able to discover and infer partitioning information automatically.
    +For example, we can store all our previously used
    +population data into a partitioned table using the following directory structure, with two extra
    +columns, `gender` and `country` as partitioning columns:
    +
    +{% highlight text %}
    +
    +path
    +└── to
    +    └── table
    +        ├── gender=male
    +        │   ├── ...
    +        │   │
    +        │   ├── country=US
    +        │   │   └── data.parquet
    +        │   ├── country=CN
    +        │   │   └── data.parquet
    +        │   └── ...
    +        └── gender=female
    +            ├── ...
    +            │
    +            ├── country=US
    +            │   └── data.parquet
    +            ├── country=CN
    +            │   └── data.parquet
    +            └── ...
    +
    +{% endhighlight %}
    +
    +By passing `path/to/table` to either `SparkSession.read.parquet` or `SparkSession.read.load`, Spark SQL
    +will automatically extract the partitioning information from the paths.
    +Now the schema of the returned DataFrame becomes:
    +
    +{% highlight text %}
    +
    +root
    +|-- name: string (nullable = true)
    +|-- age: long (nullable = true)
    +|-- gender: string (nullable = true)
    +|-- country: string (nullable = true)
    +
    +{% endhighlight %}
    +
    +Notice that the data types of the partitioning columns are automatically inferred. Currently,
    +numeric data types, date, timestamp and string type are supported. Sometimes users may not want
    +to automatically infer the data types of the partitioning columns. For these use cases, the
    +automatic type inference can be configured by
    +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type
    +inference is disabled, string type will be used for the partitioning columns.
    +
    +Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths
    +by default. For the above example, if users pass `path/to/table/gender=male` to either
    +`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not be considered as a
    +partitioning column. If users need to specify the base path that partition discovery
    +should start with, they can set `basePath` in the data source options. For example,
    +when `path/to/table/gender=male` is the path of the data and
    +users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
    +
    +### Schema Merging
    +
    +Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
    --- End diff --
    
    `ProtocolBuffer` -> `Protocol Buffers`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226246375
  
    --- Diff: docs/sql-migration-guide-upgrade.md ---
    @@ -0,0 +1,520 @@
    +---
    +layout: global
    +title: Spark SQL Upgrading Guide
    +displayTitle: Spark SQL Upgrading Guide
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
    +
    +## Upgrading From Spark SQL 2.3 to 2.4
    +
    +  - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
    +  <table class="table">
    +        <tr>
    +          <th>
    +            <b>Query</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.3 or Prior</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.4</b>
    +          </th>
    +          <th>
    +            <b>Remarks</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 1.34D);</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>false</b>
    +          </th>
    +          <th>
    +            <b>In Spark 2.4, left and right parameters are  promoted to array(double) and double type respectively.</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), '1');</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 'anystring');</b>
    +          </th>
    +          <th>
    +            <b>null</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    +          </th>
    +        </tr>
    +  </table>
    +
    +  - Since Spark 2.4, when there is a struct field in front of the IN operator before a subquery, the inner query must contain a struct field as well. In previous versions, instead, the fields of the struct were compared to the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a in (select 1, 'a' from range(1))` is not. In previous version it was the opposite.
    +  - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became case-sensitive and would resolve to columns (unless typed in lower case). In Spark 2.4 this has been fixed and the functions are no longer case-sensitive.
    +  - Since Spark 2.4, Spark will evaluate the set operations referenced in a query by following a precedence rule as per the SQL standard. If the order is not specified by parentheses, set operations are performed from left to right with the exception that all INTERSECT operations are performed before any UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence to all the set operations are preserved under a newly added configuration `spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. When this property is set to `true`, spark will evaluate the set operators from left to right as they appear in the query given no explicit ordering is enforced by usage of parenthesis.
    +  - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970.
    +  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.
    +  - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`.
    +  - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.
    +  - Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``.
    --- End diff --
    
    `an column` -> `a column`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    We might miss something in the code review. Let us play the new doc and see whether we miss anything in the code review.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226263066
  
    --- Diff: docs/sql-migration-guide-upgrade.md ---
    @@ -0,0 +1,520 @@
    +---
    +layout: global
    +title: Spark SQL Upgrading Guide
    +displayTitle: Spark SQL Upgrading Guide
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
    +
    +## Upgrading From Spark SQL 2.3 to 2.4
    +
    +  - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
    +  <table class="table">
    +        <tr>
    +          <th>
    +            <b>Query</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.3 or Prior</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.4</b>
    +          </th>
    +          <th>
    +            <b>Remarks</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 1.34D);</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>false</b>
    +          </th>
    +          <th>
    +            <b>In Spark 2.4, left and right parameters are  promoted to array(double) and double type respectively.</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), '1');</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 'anystring');</b>
    +          </th>
    +          <th>
    +            <b>null</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    --- End diff --
    
    `explict` -> `explicit`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4054/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Thanks! Merged to master/2.4. For 2.4 branch, I manually removed the migration guide from 2.4 to 3.0. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97453 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97453/testReport)** for PR 22746 at commit [`c2ad4a3`](https://github.com/apache/spark/commit/c2ad4a3420db08a4c8dbe5c3bbfb9938e3c73fff).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226262995
  
    --- Diff: docs/sql-migration-guide-upgrade.md ---
    @@ -0,0 +1,520 @@
    +---
    +layout: global
    +title: Spark SQL Upgrading Guide
    +displayTitle: Spark SQL Upgrading Guide
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
    +
    +## Upgrading From Spark SQL 2.3 to 2.4
    +
    +  - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
    +  <table class="table">
    +        <tr>
    +          <th>
    +            <b>Query</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.3 or Prior</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.4</b>
    +          </th>
    +          <th>
    +            <b>Remarks</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 1.34D);</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>false</b>
    +          </th>
    +          <th>
    +            <b>In Spark 2.4, left and right parameters are  promoted to array(double) and double type respectively.</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), '1');</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    --- End diff --
    
    `explict` -> `explicit`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226189929
  
    --- Diff: docs/_data/menu-sql.yaml ---
    @@ -0,0 +1,81 @@
    +- text: Getting Started
    +  url: sql-getting-started.html
    +  subitems:
    +    - text: "Starting Point: SparkSession"
    +      url: sql-getting-started.html#starting-point-sparksession
    +    - text: Creating DataFrames
    +      url: sql-getting-started.html#creating-dataframes
    +    - text: Untyped Dataset Operations (DataFrame operations)
    +      url: sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations
    +    - text: Running SQL Queries Programmatically
    +      url: sql-getting-started.html#running-sql-queries-programmatically
    +    - text: Global Temporary View
    +      url: sql-getting-started.html#global-temporary-view
    +    - text: Creating Datasets
    +      url: sql-getting-started.html#creating-datasets
    +    - text: Interoperating with RDDs
    +      url: sql-getting-started.html#interoperating-with-rdds
    +    - text: Aggregations
    +      url: sql-getting-started.html#aggregations
    +- text: Data Sources
    +  url: sql-data-sources.html
    +  subitems:
    +    - text: "Generic Load/Save Functions"
    +      url: sql-data-sources-load-save-functions.html
    +    - text: Parquet Files
    +      url: sql-data-sources-parquet.html
    +    - text: ORC Files
    +      url: sql-data-sources-other.html#orc-files
    +    - text: JSON Datasets
    +      url: sql-data-sources-other.html#json-datasets
    +    - text: Hive Tables
    +      url: sql-data-sources-hive-tables.html
    +    - text: JDBC To Other Databases
    +      url: sql-data-sources-jdbc.html
    +    - text: Avro Files
    +      url: sql-data-sources-avro.html
    +    - text: Troubleshooting
    +      url: sql-data-sources-other.html#troubleshooting
    --- End diff --
    
    Hi, @xuanyuanking . Generally, it looks good.
    
    Can we split `sql-data-sources-other` into three files? For me, `troubleshooting` looks weird in terms of level of information. Actually, `sql-data-sources-other` has only two files and `troubleshooting` for JDBC.
    
    Maybe, `sql-data-sources-orc`, `sql-data-sources-json` and `toubleshooting`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    @gatorsmile Sorry for the late on this, please have a look when you have time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97519/testReport)** for PR 22746 at commit [`27b066d`](https://github.com/apache/spark/commit/27b066d7635bf2d7a04c869468b3ea9273f75ef6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Thanks all reviewers! Sorry for still having some mistake in new doc and I'll keep checking on this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97500 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97500/testReport)** for PR 22746 at commit [`b3fc39d`](https://github.com/apache/spark/commit/b3fc39d005e985b4ec769e10a4221c5b4d0591b4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226190219
  
    --- Diff: docs/sql-data-sources-other.md ---
    @@ -0,0 +1,114 @@
    +---
    +layout: global
    +title: Other Data Sources
    +displayTitle: Other Data Sources
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## ORC Files
    +
    +Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files.
    +To do that, the following configurations are newly added. The vectorized reader is used for the
    +native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl`
    +is set to `native` and `spark.sql.orc.enableVectorizedReader` is set to `true`. For the Hive ORC
    +serde tables (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`),
    +the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also set to `true`.
    +
    +<table class="table">
    +  <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +  <tr>
    +    <td><code>spark.sql.orc.impl</code></td>
    +    <td><code>native</code></td>
    +    <td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.</td>
    +  </tr>
    +  <tr>
    +    <td><code>spark.sql.orc.enableVectorizedReader</code></td>
    +    <td><code>true</code></td>
    +    <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
    +  </tr>
    +</table>
    +
    +## JSON Datasets
    --- End diff --
    
    For consistency with the other data sources, `Datasets` -> `Files`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22746


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4076/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226250306
  
    --- Diff: docs/sql-performance-turing.md ---
    @@ -0,0 +1,151 @@
    +---
    +layout: global
    +title: Performance Tuning
    +displayTitle: Performance Tuning
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +For some workloads, it is possible to improve performance by either caching data in memory, or by
    +turning on some experimental options.
    +
    +## Caching Data In Memory
    +
    +Spark SQL can cache tables using an in-memory columnar format by calling `spark.catalog.cacheTable("tableName")` or `dataFrame.cache()`.
    +Then Spark SQL will scan only required columns and will automatically tune compression to minimize
    +memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableName")` to remove the table from memory.
    +
    +Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running
    +`SET key=value` commands using SQL.
    +
    +<table class="table">
    +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
    +<tr>
    +  <td><code>spark.sql.inMemoryColumnarStorage.compressed</code></td>
    +  <td>true</td>
    +  <td>
    +    When set to true Spark SQL will automatically select a compression codec for each column based
    +    on statistics of the data.
    +  </td>
    +</tr>
    +<tr>
    +  <td><code>spark.sql.inMemoryColumnarStorage.batchSize</code></td>
    +  <td>10000</td>
    +  <td>
    +    Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization
    +    and compression, but risk OOMs when caching data.
    +  </td>
    +</tr>
    +
    +</table>
    +
    +## Other Configuration Options
    +
    +The following options can also be used to tune the performance of query execution. It is possible
    +that these options will be deprecated in future release as more optimizations are performed automatically.
    +
    +<table class="table">
    +  <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
    +  <tr>
    +    <td><code>spark.sql.files.maxPartitionBytes</code></td>
    +    <td>134217728 (128 MB)</td>
    +    <td>
    +      The maximum number of bytes to pack into a single partition when reading files.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>spark.sql.files.openCostInBytes</code></td>
    +    <td>4194304 (4 MB)</td>
    +    <td>
    +      The estimated cost to open a file, measured by the number of bytes could be scanned in the same
    +      time. This is used when putting multiple files into a partition. It is better to over estimated,
    --- End diff --
    
    nit: `It is better to over estimated` -> ` It is better to over-estimate`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    **[Test build #97482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97482/testReport)** for PR 22746 at commit [`58115e5`](https://github.com/apache/spark/commit/58115e5a69670f45cf05d2026cb57abb595fe073).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r225794532
  
    --- Diff: docs/sql-reference.md ---
    @@ -0,0 +1,641 @@
    +---
    +layout: global
    +title: Reference
    +displayTitle: Reference
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Data Types
    +
    +Spark SQL and DataFrames support the following data types:
    +
    +* Numeric types
    +    - `ByteType`: Represents 1-byte signed integer numbers.
    --- End diff --
    
    Thanks, done in 58115e5.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226247607
  
    --- Diff: docs/sql-migration-guide-upgrade.md ---
    @@ -0,0 +1,520 @@
    +---
    +layout: global
    +title: Spark SQL Upgrading Guide
    +displayTitle: Spark SQL Upgrading Guide
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
    +
    +## Upgrading From Spark SQL 2.3 to 2.4
    +
    +  - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
    +  <table class="table">
    +        <tr>
    +          <th>
    +            <b>Query</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.3 or Prior</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.4</b>
    +          </th>
    +          <th>
    +            <b>Remarks</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 1.34D);</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>false</b>
    +          </th>
    +          <th>
    +            <b>In Spark 2.4, left and right parameters are  promoted to array(double) and double type respectively.</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), '1');</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 'anystring');</b>
    +          </th>
    +          <th>
    +            <b>null</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    +          </th>
    +        </tr>
    +  </table>
    +
    +  - Since Spark 2.4, when there is a struct field in front of the IN operator before a subquery, the inner query must contain a struct field as well. In previous versions, instead, the fields of the struct were compared to the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a in (select 1, 'a' from range(1))` is not. In previous version it was the opposite.
    +  - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly became case-sensitive and would resolve to columns (unless typed in lower case). In Spark 2.4 this has been fixed and the functions are no longer case-sensitive.
    +  - Since Spark 2.4, Spark will evaluate the set operations referenced in a query by following a precedence rule as per the SQL standard. If the order is not specified by parentheses, set operations are performed from left to right with the exception that all INTERSECT operations are performed before any UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence to all the set operations are preserved under a newly added configuration `spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. When this property is set to `true`, spark will evaluate the set operators from left to right as they appear in the query given no explicit ordering is enforced by usage of parenthesis.
    +  - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970.
    +  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.
    +  - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`.
    +  - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.
    +  - Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``.
    +  - Since Spark 2.4, writing a dataframe with an empty or nested empty schema using any file formats (parquet, orc, json, text, csv etc.) is not allowed. An exception is thrown when attempting to write dataframes with empty schema.
    +  - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.legacy.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0.
    +  - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
    +  - Since Spark 2.4, renaming a managed table to existing location is not allowed. An exception is thrown when attempting to rename a managed table to existing location.
    +  - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
    +  - Since Spark 2.4, Spark has enabled non-cascading SQL cache invalidation in addition to the traditional cache invalidation mechanism. The non-cascading cache invalidation mechanism allows users to remove a cache without impacting its dependent caches. This new cache invalidation mechanism is used in scenarios where the data of the cache to be removed is still valid, e.g., calling unpersist() on a Dataset, or dropping a temporary view. This allows users to free up memory and keep the desired caches valid at the same time.
    +  - In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files.
    +  - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
    +  - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.
    +  - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `False`.
    +  - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
    +  - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings are equal to `null` values and do not reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.  
    +  - Since Spark 2.4, The LOAD DATA command supports wildcard `?` and `*`, which match any one character, and zero or more characters, respectively. Example: `LOAD DATA INPATH '/tmp/folder*/'` or `LOAD DATA INPATH '/tmp/part-?'`. Special Characters like `space` also now work in paths. Example: `LOAD DATA INPATH '/tmp/folder name/'`.
    +  - In Spark version 2.3 and earlier, HAVING without GROUP BY is treated as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as `SELECT 1 FROM range(10) WHERE true`  and returns 10 rows. This violates SQL standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) HAVING true` will return only one row. To restore the previous behavior, set `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` to `true`.
    +
    +## Upgrading From Spark SQL 2.3.0 to 2.3.1 and above
    +
    +  - As of version 2.3.1 Arrow functionality, including `pandas_udf` and `toPandas()`/`createDataFrame()` with `spark.sql.execution.arrow.enabled` set to `True`, has been marked as experimental. These are still evolving and not currently recommended for use in production.
    +
    +## Upgrading From Spark SQL 2.2 to 2.3
    +
    +  - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
    +  - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
    +  - Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown.
    +  - Partition column inference previously found incorrect common type for different inferred types, for example, previously it ended up with double type as the common type for double type and date type. Now it finds the correct common type for such conflicts. The conflict resolution follows the table below:
    +    <table class="table">
    +      <tr>
    +        <th>
    +          <b>InputA \ InputB</b>
    +        </th>
    +        <th>
    +          <b>NullType</b>
    +        </th>
    +        <th>
    +          <b>IntegerType</b>
    +        </th>
    +        <th>
    +          <b>LongType</b>
    +        </th>
    +        <th>
    +          <b>DecimalType(38,0)*</b>
    +        </th>
    +        <th>
    +          <b>DoubleType</b>
    +        </th>
    +        <th>
    +          <b>DateType</b>
    +        </th>
    +        <th>
    +          <b>TimestampType</b>
    +        </th>
    +        <th>
    +          <b>StringType</b>
    +        </th>
    +      </tr>
    +      <tr>
    +        <td>
    +          <b>NullType</b>
    +        </td>
    +        <td>NullType</td>
    +        <td>IntegerType</td>
    +        <td>LongType</td>
    +        <td>DecimalType(38,0)</td>
    +        <td>DoubleType</td>
    +        <td>DateType</td>
    +        <td>TimestampType</td>
    +        <td>StringType</td>
    +      </tr>
    +      <tr>
    +        <td>
    +          <b>IntegerType</b>
    +        </td>
    +        <td>IntegerType</td>
    +        <td>IntegerType</td>
    +        <td>LongType</td>
    +        <td>DecimalType(38,0)</td>
    +        <td>DoubleType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +      </tr>
    +      <tr>
    +        <td>
    +          <b>LongType</b>
    +        </td>
    +        <td>LongType</td>
    +        <td>LongType</td>
    +        <td>LongType</td>
    +        <td>DecimalType(38,0)</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +      </tr>
    +      <tr>
    +        <td>
    +          <b>DecimalType(38,0)*</b>
    +        </td>
    +        <td>DecimalType(38,0)</td>
    +        <td>DecimalType(38,0)</td>
    +        <td>DecimalType(38,0)</td>
    +        <td>DecimalType(38,0)</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +      </tr>
    +      <tr>
    +        <td>
    +          <b>DoubleType</b>
    +        </td>
    +        <td>DoubleType</td>
    +        <td>DoubleType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>DoubleType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +      </tr>
    +      <tr>
    +        <td>
    +          <b>DateType</b>
    +        </td>
    +        <td>DateType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>DateType</td>
    +        <td>TimestampType</td>
    +        <td>StringType</td>
    +      </tr>
    +      <tr>
    +        <td>
    +          <b>TimestampType</b>
    +        </td>
    +        <td>TimestampType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>TimestampType</td>
    +        <td>TimestampType</td>
    +        <td>StringType</td>
    +      </tr>
    +      <tr>
    +        <td>
    +          <b>StringType</b>
    +        </td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +        <td>StringType</td>
    +      </tr>
    +    </table>
    +
    +    Note that, for <b>DecimalType(38,0)*</b>, the table above intentionally does not cover all other combinations of scales and precisions because currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 1.1 is inferred as double type.
    +  - In PySpark, now we need Pandas 0.19.2 or upper if you want to use Pandas related functionalities, such as `toPandas`, `createDataFrame` from Pandas DataFrame, etc.
    +  - In PySpark, the behavior of timestamp values for Pandas related functionalities was changed to respect session timezone. If you want to use the old behavior, you need to set a configuration `spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See [SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details.
    +  - In PySpark, `na.fill()` or `fillna` also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.
    +  - Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Broadcast Hint](sql-performance-turing.html#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
    +  - Since Spark 2.3, when all inputs are binary, `functions.concat()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.concatBinaryAsString` to `true`.
    +  - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`.
    +
    + - Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible (instead of returning NULL). This is compliant with SQL ANSI 2011 specification and Hive's new behavior introduced in Hive 2.2 (HIVE-15331). This involves the following changes
    +    - The rules to determine the result type of an arithmetic operation have been updated. In particular, if the precision / scale needed are out of the range of available values, the scale is reduced up to 6, in order to prevent the truncation of the integer part of the decimals. All the arithmetic operations are affected by the change, ie. addition (`+`), subtraction (`-`), multiplication (`*`), division (`/`), remainder (`%`) and positive module (`pmod`).
    +    - Literal values used in SQL operations are converted to DECIMAL with the exact precision and scale needed by them.
    +    - The configuration `spark.sql.decimalOperations.allowPrecisionLoss` has been introduced. It defaults to `true`, which means the new behavior described here; if set to `false`, Spark uses previous rules, ie. it doesn't adjust the needed scale to represent the values and it returns NULL if an exact representation of the value is not possible.
    +  - In PySpark, `df.replace` does not allow to omit `value` when `to_replace` is not a dictionary. Previously, `value` could be omitted in the other cases and had `None` by default, which is counterintuitive and error-prone.
    +  - Un-aliased subquery's semantic has not been well defined with confusing behaviors. Since Spark 2.3, we invalidate such confusing cases, for example: `SELECT v.i from (SELECT i FROM v)`, Spark will throw an analysis exception in this case because users should not be able to use the qualifier inside a subquery. See [SPARK-20690](https://issues.apache.org/jira/browse/SPARK-20690) and [SPARK-21335](https://issues.apache.org/jira/browse/SPARK-21335) for more details.
    +
    +  - When creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 2.3, the builder come to not update the configurations. If you want to update them, you need to update them prior to creating a `SparkSession`.
    --- End diff --
    
    `the build come` -> `the builder comes`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22746: [SPARK-24499][SQL][DOC] Split the page of sql-programmin...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22746
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4033/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org