You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zero323 <gi...@git.apache.org> on 2017/05/10 11:02:51 UTC

[GitHub] spark pull request #17938: [DOCS][SQL] Document bucketing and partitioning i...

GitHub user zero323 opened a pull request:

    https://github.com/apache/spark/pull/17938

    [DOCS][SQL] Document bucketing and partitioning in SQL guide

    ## What changes were proposed in this pull request?
    
    - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
    - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
    
    ## How was this patch tested?
    
    Manual tests, docs build.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zero323/spark DOCS-BUCKETING-AND-PARTITIONING

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17938.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17938
    
----
commit 560fd7978c2a18c8c216604eeea4563bcc4f7c5c
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-05-10T09:56:28Z

    Add Scala examples

commit c0b037b302b10c20b2dadcc32048f3ee370d1864
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-05-10T09:56:50Z

    Add Python examples

commit b2f45efcb883508e906232582e4a9e89b7f706d0
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-05-10T10:22:27Z

    Add Java examples

commit 0af67cea0f1a1644139115274f14dab76732b5b5
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-05-10T10:32:47Z

    Add examples to sql guide

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    LGTM except a few minor comments. 
    
    cc @tejasapatil @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [DOCS][SQL] Document bucketing and partitioning in SQL g...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76748/testReport)** for PR 17938 at commit [`20c7ca6`](https://github.com/apache/spark/commit/20c7ca699876d1f7b1b5096bdf037492c44d3cd7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [DOCS][SQL] Document bucketing and partitioning in SQL g...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76748 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76748/testReport)** for PR 17938 at commit [`20c7ca6`](https://github.com/apache/spark/commit/20c7ca699876d1f7b1b5096bdf037492c44d3cd7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116030963
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    --- End diff --
    
    @cloud-fan  I think we can redirect to partition discovery here. But explaining the difference and possible applications (low vs. high cardinality) could be a good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116386782
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    --- End diff --
    
    @zero323 Could you also resolve this? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r115890492
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    --- End diff --
    
    I feel that examples are missing writing to partitioned + bucketed table. eg.
    
    ```
    my_dataframe.write.format("orc").partitionBy("i").bucketBy(8, "j", "k").sortBy("j", "k").saveAsTable("my_table")
    ```
    
    There could be multiple possible orderings of `partitionBy`, `bucketBy` and `sortBy` calls. Not all of them are supported, not all of them would produce correct outputs. I have not done any exhaustive study of the same but I think this should be documented to guide people while using these APIs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76831/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76825/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #77436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77436/testReport)** for PR 17938 at commit [`bea0676`](https://github.com/apache/spark/commit/bea0676088dadbc5af544f581aa8a2ed49355acc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76900/testReport)** for PR 17938 at commit [`92fb3b3`](https://github.com/apache/spark/commit/92fb3b3e00a666ff3bd1eca4e5dee0cefcca2d55).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76905/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76828/testReport)** for PR 17938 at commit [`cc1bfcf`](https://github.com/apache/spark/commit/cc1bfcf281b32860113215c3f34cbacf3bb47cbb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76831/testReport)** for PR 17938 at commit [`a7aff81`](https://github.com/apache/spark/commit/a7aff811aa88b1f93364aa51ab95b6b64fa63d8d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76831 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76831/testReport)** for PR 17938 at commit [`a7aff81`](https://github.com/apache/spark/commit/a7aff811aa88b1f93364aa51ab95b6b64fa63d8d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r115925939
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1766,12 +1806,6 @@ Spark SQL supports the vast majority of Hive features, such as:
     Below is a list of Hive features that we don't support yet. Most of these features are rarely used
     in Hive deployments.
     
    -**Major Hive Features**
    -
    -* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
    -  doesn't support buckets yet.
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76900/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    In the current 2.2 docs, we already updated all the syntax to `CREATE TABLE ... USING...`. This is the new change delivered in 2.2 
    
    Thus, it is OK to document like what you just committed. Let me review them carefully now. Thanks for your work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76899/testReport)** for PR 17938 at commit [`b5babf6`](https://github.com/apache/spark/commit/b5babf65571661ca45880cd80a950959f66523a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116059777
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    --- End diff --
    
    >> Shouldn't the output be the same no matter the order?
    
    Theoretically yes. Practically I don't know what happens. Since you are documenting, it will be worthwhile to check that and record if it works as expected (or if there is any weirdness).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    When you omit `USING`, it's hive style CREATE TABLE syntax, which is very different from Spark. We should encourage users to use the spark style CREATE TABLE syntax and only document it(with USING statement).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    we are going to support bucketing in hive style CREATE TABLE syntax soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r115880901
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1766,12 +1806,6 @@ Spark SQL supports the vast majority of Hive features, such as:
     Below is a list of Hive features that we don't support yet. Most of these features are rarely used
     in Hive deployments.
     
    -**Major Hive Features**
    -
    -* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
    -  doesn't support buckets yet.
    --- End diff --
    
    We do support buckets, but it is slightly different from Hive. See the ongoing PR:  https://github.com/apache/spark/pull/17644
    
    Could you document the difference too? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371744
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +while partitioning can be used with both `save` and `saveAsTable`:
    +
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partitioning python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_by_favorite_color(
    +  name STRING, 
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING csv PARTITIONED BY(favorite_color);
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +It is possible to use both partitions and buckets for a single table:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partition_and_bucket python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_and_partitioned(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +PARTITIONED BY (favorite_color)
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
    +Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes
    +data across fixed number of buckets and can be used if a number of unique values is unbounded.
    --- End diff --
    
    `used if` -> `used when `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    @zero323 Could you also document SQL interface for creating bucket and partition tables in this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    @HyukjinKwon Sounds good. [SPARK-20694](https://issues.apache.org/jira/browse/SPARK-20694). 
    
    Should we document the difference between buckets (metastore based) and partitions (file system based)? The latter one could by done by referencing [Partition Discover](https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17938


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76830 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76830/testReport)** for PR 17938 at commit [`606f1e3`](https://github.com/apache/spark/commit/606f1e3a5f672d8f7a7dc98fe041e347e65a2d03).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76867/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [DOCS][SQL] Document bucketing and partitioning in SQL g...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76748/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76813/testReport)** for PR 17938 at commit [`a14296a`](https://github.com/apache/spark/commit/a14296a5ad443e04471fc26e361d7ee77f7dfff2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76867/testReport)** for PR 17938 at commit [`c4d7856`](https://github.com/apache/spark/commit/c4d7856c82aab845cf9cef4460302461db7e1384).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r115888674
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and and sort or partition the output. 
    --- End diff --
    
    nit: `bucket and and sort` : double and


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371680
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +while partitioning can be used with both `save` and `saveAsTable`:
    --- End diff --
    
    Nit: 
    ```
    both `save` and `saveAsTable`
    ```
    ->
    ```
    both `save` and `saveAsTable` when using the Dataset APIs. 
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r118643142
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,114 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source, it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting are applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_numbers array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +while partitioning can be used with both `save` and `saveAsTable` when using the Dataset APIs.
    +
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partitioning python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_by_favorite_color(
    +  name STRING, 
    +  favorite_color STRING,
    +  favorite_numbers array<integer>
    +) USING csv PARTITIONED BY(favorite_color);
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +It is possible to use both partitioning and bucketing for a single table:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partition_and_bucket python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_and_partitioned(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_numbers array<integer>
    +) USING parquet 
    +PARTITIONED BY (favorite_color)
    +CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
    +Thus, it has limited applicability to columns with high cardinality. In contrast 
    + `bucketBy` distributes
    +data across fixed number of buckets and can be used when a number of unique values is unbounded.
    --- End diff --
    
    Nit: `fixed number of` -> `a fixed number of`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76911/testReport)** for PR 17938 at commit [`3a8b6e9`](https://github.com/apache/spark/commit/3a8b6e94dd40372704aa4e1cdce015bcc1c3b893).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77436/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    @cloud-fan Thanks for the clarification. Just a thought - shouldn't we either support it consistently or don't support at all? Current behaviour is quite confusing and I don't think that documentation alone will cut it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371598
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    --- End diff --
    
    To be consistent with the example in the other APIs, it is missing the `SORTED BY` clause.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371649
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +while partitioning can be used with both `save` and `saveAsTable`:
    +
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partitioning python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_by_favorite_color(
    +  name STRING, 
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING csv PARTITIONED BY(favorite_color);
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +It is possible to use both partitions and buckets for a single table:
    --- End diff --
    
    `partitions and buckets` -> `partitioning and bucketing`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r115923548
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +</div>
    +
    +while partitioning can be used with both `save` and `saveAsTable`:
    --- End diff --
    
    like @tejasapatil suggested, we should give one more example about partitioned and bucketed table, so that users know they can use bucketing and partitioning at the same time


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by tejasapatil <gi...@git.apache.org>.

Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r115888199
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1766,12 +1806,6 @@ Spark SQL supports the vast majority of Hive features, such as:
     Below is a list of Hive features that we don't support yet. Most of these features are rarely used
     in Hive deployments.
     
    -**Major Hive Features**
    -
    -* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
    -  doesn't support buckets yet.
    --- End diff --
    
    Lets keep this until SPARK-19256 gets resolved


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [DOCS][SQL] Document bucketing and partitioning in SQL g...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r115923119
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    --- End diff --
    
    shall we emphasize partitioning? I think it's more widely used than bucketing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76828/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    @cloud-fan @tejasapatil Could you please help review this PR? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116072807
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    --- End diff --
    
    Oh, I thought you are implying there are some known issues. This actually behaves sensibly - all supported options seem to work independent of the order, and unsupported ones (`partitionBy` + `sortBy` without `bucketBy` or overlapping `bucketBy` and `partitionBy` columns) give enough feedback to diagnose the issue.
    
    I haven't tested this with large datasets though, so there can be hidden issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76828/testReport)** for PR 17938 at commit [`cc1bfcf`](https://github.com/apache/spark/commit/cc1bfcf281b32860113215c3f34cbacf3bb47cbb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76911/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76813/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    @zero323 We already support it for data source tables. Below is just an example. 
    ```SQL
    CREATE TABLE tbl(a INT, b INT) USING parquet CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116029940

--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat

Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.

+### Bucketing, Sorting and Partitioning
--- End diff --

@tejasapatil

> There could be multiple possible orderings of `partitionBy,` `bucketBy` and `sortBy` calls. Not all of them are supported, not all of them would produce correct outputs.

Shouldn't the output be the same no matter the order? `sortBy` is not applicable for `partitionBy` and takes precedence over `bucketBy`, if both are present. This is Hive's behaviour if I am not mistaken, and at first glance Spark is doing the same thing. It there any gotcha here?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    (I think I am not supposed to decide this and probably the best is the confirmation from a commiter)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76825/testReport)** for PR 17938 at commit [`7bf4bbc`](https://github.com/apache/spark/commit/7bf4bbc30a6fa821d85285519c035be0a4f66b0c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    @gatorsmile Sure, but I assume you mean only `PARTITION BY`, right? I don't think that `CLUSTER BY` or  `SORT BY` is supported in SQL (should it be supported after  #17644 is resolved?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76813/testReport)** for PR 17938 at commit [`a14296a`](https://github.com/apache/spark/commit/a14296a5ad443e04471fc26e361d7ee77f7dfff2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76899/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76830/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371733
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +while partitioning can be used with both `save` and `saveAsTable`:
    +
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partitioning python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_by_favorite_color(
    +  name STRING, 
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING csv PARTITIONED BY(favorite_color);
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +It is possible to use both partitions and buckets for a single table:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partition_and_bucket python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_and_partitioned(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +PARTITIONED BY (favorite_color)
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
    +Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes
    --- End diff --
    
    `In contrast `bucketBy` distributes` -> `In contrast, `bucketBy` distributes`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r115923280
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1766,12 +1806,6 @@ Spark SQL supports the vast majority of Hive features, such as:
     Below is a list of Hive features that we don't support yet. Most of these features are rarely used
     in Hive deployments.
     
    -**Major Hive Features**
    -
    -* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
    -  doesn't support buckets yet.
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Will merge it when my minor comment is resolved.
    
    Thanks for working on it! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76899/testReport)** for PR 17938 at commit [`b5babf6`](https://github.com/apache/spark/commit/b5babf65571661ca45880cd80a950959f66523a1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371615
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    --- End diff --
    
    Could you please use the same table names `people_bucketed` with the same column names in the example? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    Thanks @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76905/testReport)** for PR 17938 at commit [`65ac310`](https://github.com/apache/spark/commit/65ac310787927e4180b93863e361d87265c16ce5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76867/testReport)** for PR 17938 at commit [`c4d7856`](https://github.com/apache/spark/commit/c4d7856c82aab845cf9cef4460302461db7e1384).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76911/testReport)** for PR 17938 at commit [`3a8b6e9`](https://github.com/apache/spark/commit/3a8b6e94dd40372704aa4e1cdce015bcc1c3b893).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76825 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76825/testReport)** for PR 17938 at commit [`7bf4bbc`](https://github.com/apache/spark/commit/7bf4bbc30a6fa821d85285519c035be0a4f66b0c).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    @gatorsmile  Huh...  in that case it looks like parser (?) needs a little bit of work, unless of course following are features.  
    
    - Omitting `USING` doesn't work 
    
      ```sql
      CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
      CLUSTERED BY(user_id) INTO 256 BUCKETS
      ```
      with:
    
      ```
      Error in query: 
      Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 0)
      
      == SQL ==
      CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
      ^^^
      CLUSTERED BY(user_id) INTO 256 BUCKETS
      ```
    
    - Omitting `USING` adding `PARTITION BY` with column not present in the main clause (valid Hive DDL) doesn't work: 
    
      ```sql
      CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
      PARTITIONED BY (department STRING)
      CLUSTERED BY(user_id) INTO 256 BUCKETS
      ```
      with
    
      ```
      Error in query: 
      Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 2)
      
      == SQL ==
        CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
      --^^^
        PARTITIONED BY (department STRING)
        CLUSTERED BY(user_id) INTO 256 BUCKETS
      ```
    
    - `PARTITION BY` alone works:
    
      ```sql
      CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
      PARTITIONED BY (department STRING)
      ```
    
    -   `PARTITION BY` with `USING` when partition column is in the main spec works:
    
         ```sql
        CREATE TABLE user_info_bucketed(
          user_id BIGINT, firstname STRING, lastname STRING, department STRING)
        USING parquet
        PARTITIONED BY (department)
        ```
    
    -  `CLUSTERED BY` +  `PARTITION BY` with `USING` when partition column is in the main spec works:
    
        ```sql
        CREATE TABLE user_info_bucketed(
           user_id BIGINT, firstname STRING, lastname STRING, department STRING)
        USING parquet
        PARTITIONED BY (department)
        CLUSTERED BY(user_id) INTO 256 BUCKETS 
        ```
    - `PARTITION BY` when parition column is in the main spec, `USING` omitted:
    
        ```sql
        CREATE TABLE user_info_bucketed(
         user_id BIGINT, firstname STRING, lastname STRING, department STRING)
        PARTITIONED BY (department)
        ```
     
        with:
    
        ```
        Error in query: 
        mismatched input ')' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSA
 CTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'T
 RANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 30)
        
        == SQL ==
            CREATE TABLE user_info_bucketed(
              user_id BIGINT, firstname STRING, lastname STRING, department STRING)
            PARTITIONED BY (department)
        ------------------------------^^^
        ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371727
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_sorting_and_bucketing python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_by_name(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +while partitioning can be used with both `save` and `saveAsTable`:
    +
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partitioning python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_by_favorite_color(
    +  name STRING, 
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING csv PARTITIONED BY(favorite_color);
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +It is possible to use both partitions and buckets for a single table:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +{% include_example write_partition_and_bucket python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TABLE users_bucketed_and_partitioned(
    +  name STRING,
    +  favorite_color STRING,
    +  favorite_NUMBERS array<integer>
    +) USING parquet 
    +PARTITIONED BY (favorite_color)
    +CLUSTERED BY(name) INTO 42 BUCKETS;
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +`partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section.
    +Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes
    --- End diff --
    
    `Because of that it has`
    ->
    ```
    Thus, it has


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #77436 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77436/testReport)** for PR 17938 at commit [`bea0676`](https://github.com/apache/spark/commit/bea0676088dadbc5af544f581aa8a2ed49355acc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371632
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    +Bucketing and sorting is applicable only to persistent tables:
    --- End diff --
    
    `is applicable` -> `are applicable`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76905/testReport)** for PR 17938 at commit [`65ac310`](https://github.com/apache/spark/commit/65ac310787927e4180b93863e361d87265c16ce5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76830/testReport)** for PR 17938 at commit [`606f1e3`](https://github.com/apache/spark/commit/606f1e3a5f672d8f7a7dc98fe041e347e65a2d03).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    **[Test build #76900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76900/testReport)** for PR 17938 at commit [`92fb3b3`](https://github.com/apache/spark/commit/92fb3b3e00a666ff3bd1eca4e5dee0cefcca2d55).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17938: [DOCS][SQL] Document bucketing and partitioning in SQL g...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17938
  
    @zero323, what do you think about opening a JIRA or turning this as a followup for your previous PR? I know it is a doc fix but it sounds pretty important and non-trivial fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17938#discussion_r116371627
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat
     
     Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
     
    +### Bucketing, Sorting and Partitioning
    +
    +For file-based data source it is also possible to bucket and sort or partition the output. 
    --- End diff --
    
    Nit, `For file-based data source it` -> `For file-based data source, it`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org