You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zero323 <gi...@git.apache.org> on 2017/02/27 05:17:51 UTC

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

GitHub user zero323 opened a pull request:

    https://github.com/apache/spark/pull/17077

    [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy 

    ## What changes were proposed in this pull request?
    
    Adds Python wrappers for `DataFrameWriter.bucketBy` and `DataFrameWriter.sortBy` ([SPARK-16931](https://issues.apache.org/jira/browse/SPARK-16931))
    
    ## How was this patch tested?
    
    Unit tests covering new feature.
    
    __Note__: Based on work of @GregBowyer (f49b9a23468f7af32cb53d2b654272757c151725)
    
    CC @HyukjinKwon 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zero323/spark SPARK-16931

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17077.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17077
    
----
commit 3024d1cca60793f9aca2cf144bc630106374131c
Author: Greg Bowyer <gb...@fastmail.co.uk>
Date:   2016-08-06T00:53:30Z

    [SPARK-16931][PYTHON] PySpark APIS for bucketBy and sortBy

commit 7d911c647f21ada7fb429fd7c1c5f15934ff8847
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-02-27T02:59:22Z

    Add tests for bucketed writes

commit 72c04a3f196da5223ebb44725aa88cffa81036e4
Author: zero323 <ze...@users.noreply.github.com>
Date:   2017-02-27T02:59:52Z

    Check input types in sortBy / bucketBy

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #75667 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75667/testReport)** for PR 17077 at commit [`71c9e0f`](https://github.com/apache/spark/commit/71c9e0faf39b979eb7f61d74af8c1821d0a0bcf3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #75666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75666/testReport)** for PR 17077 at commit [`481416d`](https://github.com/apache/spark/commit/481416d695d804144d041a98ea929b88829ebe47).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110538647
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -545,6 +545,57 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.2)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    --- End diff --
    
    So the `bucketBy` description in the scaladoc is a bit more in depth, you might just want to copy that.
    
    Also generally our style for multi-line doc string is to have the open `"""` on its own line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115129682
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,60 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    --- End diff --
    
    I'd copy the full description from DataFrameWriter here since comparing it to Hive could help people new to Spark understand what bucketBy does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115153302
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,63 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, col, *cols):
    +        """Buckets the output by the given columns.If specified,
    +        the output is laid out on the file system similar to Hive's bucketing scheme.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param col: a name of a column, or a list of names.
    +        :param cols: additional names (optional). If `col` is a list it should be empty.
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    --- End diff --
    
    uh. Yes. Bucket info is not part of the file/directory names, unlike partitioning info. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    One minor comment but otherwise looking in very good shape.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73523/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73523/testReport)** for PR 17077 at commit [`9fde39f`](https://github.com/apache/spark/commit/9fde39fa2174e9e67d6045b890f8cc0fc76cd61b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73504/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    This looks like an important improvement that might make sense to try and get in for 2.2 so I'll try and get some reviewing in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76545/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #76545 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76545/testReport)** for PR 17077 at commit [`c996828`](https://github.com/apache/spark/commit/c99682829242e4a993a685cdacc28cffe2434292).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110542557
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -545,6 +545,57 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.2)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    --- End diff --
    
    Both
    
    ```
    """
    ...
    """
    ```
    
    or
    
    ```
    """...
    """
    ```
    
    comply pep8 for multiple-line docstring up to my knowledge although I don't think a specific way has been preferred in this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110794303
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    +            1
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name in ("x", "y") and c.isBucket]),
    --- End diff --
    
    Thank you for taking my opinion into account. Yea, we should remove or change the version. I meant to follow the rest of contents.
    
    Generally, the contents in documentation has been matched among APIs in different languages up to my knowledge. I don't think this is a kind of a must but I think it is safer to avoid getting blamed for any reason in the future and confusion for the users.
    
    I have seen several minor PRs fixing documentations (e.g., typos) that has to identically be fixed for other APIs in different language and I also made some PRs to match the documentations, e.g., https://github.com/apache/spark/pull/17429


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r103138278
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2038,6 +2038,53 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket") if c.name == "x" and c.isBucket]),
    --- End diff --
    
    Oh, BTW, I assume it exceeds 100 length limit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110542103
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -545,6 +545,57 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.2)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    --- End diff --
    
    Regarding style I had  a similar exchange with @jkbradley lately (https://github.com/apache/spark/pull/17218#pullrequestreview-29059063).  If a single convention is desired a believe it should be documented and the remaining docstrings should be adjusted. Personally I am indifferent thought PEP 8 and PEP 257 seem to prefer this convention over placing opening quotes in a separate line.
    
    >  you might just want to copy that.
    
    Do you mean [this](https://github.com/apache/spark/blob/364b0db75308ddd346b4ab1e032680e8eb4c1753/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L183-L186)? I wonder if should rather document that it is allowed only with `saveAsTable`. What do you think?
    
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110634385
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    +            1
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name in ("x", "y") and c.isBucket]),
    +            2
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort
    +        df.write.bucketBy(2, "x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    +            1
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with a list of columns
    +        df.write.bucketBy(3, ["x", "y"]).mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name in ("x", "y") and c.isBucket]),
    +            2
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with a list of columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy(["y", "z"])
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with multiple columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy("y", "z")
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    --- End diff --
    
    @zero323, should we drop the table before or after this test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73535/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110538670
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -545,6 +545,57 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.2)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param cols: name of columns
    +
    +        >>> (df.write.format('parquet')
    +        ...     .bucketBy(100, 'year', 'month')
    +        ...     .mode("overwrite")
    +        ...     .saveAsTable('bucketed_table'))
    +        """
    +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
    +            cols = cols[0]
    +
    +        if not isinstance(numBuckets, int):
    +            raise TypeError("numBuckets should be an int, got {0}.".format(type(numBuckets)))
    +
    +        if not all(isinstance(c, basestring) for c in cols):
    +            raise TypeError("cols argument should be a string or a sequence of strings.")
    +
    +        col = cols[0]
    +        cols = cols[1:]
    +
    +        self._jwrite = self._jwrite.bucketBy(numBuckets, col, _to_seq(self._spark._sc, cols))
    +        return self
    +
    +    @since(2.2)
    +    def sortBy(self, *cols):
    +        """Sorts the output in each bucket by the given columns on the file system.
    --- End diff --
    
    Same comment as above with regards to the docstring


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    The SQL document update can be a separate PR. Thanks for your work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73523/testReport)** for PR 17077 at commit [`9fde39f`](https://github.com/apache/spark/commit/9fde39fa2174e9e67d6045b890f8cc0fc76cd61b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110544338
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2038,6 +2038,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    --- End diff --
    
    BTW, maybe, we should break this into multiple lines. It seems not readable if more commits should be pushed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    LGTM
    
    `[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy` -> `[SPARK-16931][PYTHON][SQL] Add Python wrapper for bucketBy and sortBy`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r103139360
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2038,6 +2038,53 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket") if c.name == "x" and c.isBucket]),
    --- End diff --
    
    Indeed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73522/testReport)** for PR 17077 at commit [`0ef84fb`](https://github.com/apache/spark/commit/0ef84fbb15e7cdfe0b2d8353ca315dce9b2fabfb).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75667/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #76545 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76545/testReport)** for PR 17077 at commit [`c996828`](https://github.com/apache/spark/commit/c99682829242e4a993a685cdacc28cffe2434292).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #75666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75666/testReport)** for PR 17077 at commit [`481416d`](https://github.com/apache/spark/commit/481416d695d804144d041a98ea929b88829ebe47).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #75658 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75658/testReport)** for PR 17077 at commit [`7b93482`](https://github.com/apache/spark/commit/7b93482f31f2efb3d4d742eb3e385e6b4a2bc14e).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73522/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r111221795
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,56 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        def count_bucketed_cols(names, table="pyspark_bucket"):
    +            """Given a sequence of column names and a table name
    +            query the catalog and return number o columns which are
    +            used for bucketing
    +            """
    +            cols = self.spark.catalog.listColumns(table)
    +            num = len([c for c in cols if c.name in names and c.isBucket])
    +            return num
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x"]), 1)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x", "y"]), 2)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort
    +        df.write.bucketBy(2, "x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x"]), 1)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with a list of columns
    +        df.write.bucketBy(3, ["x", "y"]).mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x", "y"]), 2)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with a list of columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy(["y", "z"])
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with multiple columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy("y", "z")
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket")
    --- End diff --
    
    @holdenk Do you suggest adding `tearDown`? I thought about it but right now tests are so inflated (sadly not much support for [SPARK-19224](https://issues.apache.org/jira/browse/SPARK-19224)) it will be completely detached from the context.
    
    From the other hand adding artificial `try ... finally` seems wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r103138165
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -545,6 +545,55 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.1)
    --- End diff --
    
    Maybe it should be 2.2 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73527/testReport)** for PR 17077 at commit [`18c709c`](https://github.com/apache/spark/commit/18c709c4bf77fc6db5530e00a9e5bba0e1ab0250).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115149598
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,63 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, col, *cols):
    +        """Buckets the output by the given columns.If specified,
    +        the output is laid out on the file system similar to Hive's bucketing scheme.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param col: a name of a column, or a list of names.
    +        :param cols: additional names (optional). If `col` is a list it should be empty.
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    --- End diff --
    
    This is not accurate. We also can use `save` to store the bucked tables without saving its metadata in metastore. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115136451
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,60 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param cols: name of columns
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    +
    +        >>> (df.write.format('parquet')
    +        ...     .bucketBy(100, 'year', 'month')
    +        ...     .mode("overwrite")
    +        ...     .saveAsTable('bucketed_table'))
    +        """
    +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
    --- End diff --
    
    If `len(cols) == 0`, users could hit strange errors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115151299
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,63 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, col, *cols):
    +        """Buckets the output by the given columns.If specified,
    +        the output is laid out on the file system similar to Hive's bucketing scheme.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param col: a name of a column, or a list of names.
    +        :param cols: additional names (optional). If `col` is a list it should be empty.
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    --- End diff --
    
    @gatorsmile Can we?
    
    ```
    ➜  spark git:(master) git rev-parse HEAD   
    2cf83c47838115f71419ba5b9296c69ec1d746cd
    ➜  spark git:(master) bin/spark-shell 
    Spark context Web UI available at http://192.168.1.101:4041
    Spark context available as 'sc' (master = local[*], app id = local-1494184109262).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
          /_/
             
    Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> Seq(("a", 1, 3)).toDF("x", "y", "z").write.bucketBy(3, "x", "y").format("parquet").save("/tmp/foo")
    org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now;
      at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:305)
      at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:231)
      at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
      ... 48 elided
    ```
    
    `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110632132
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    +            1
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name in ("x", "y") and c.isBucket]),
    --- End diff --
    
    @zero323, I am sorry. What do you think about something like this one below?:
    
    ```python
    cols = self.spark.catalog.listColumns("pyspark_bucket")
    num = len([c for c in cols if c.name in ("x", "y") and c.isBucket])
    self.assertEqual(num, 2)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115129884
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,60 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param cols: name of columns
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    +
    +        >>> (df.write.format('parquet')
    +        ...     .bucketBy(100, 'year', 'month')
    +        ...     .mode("overwrite")
    +        ...     .saveAsTable('bucketed_table'))
    +        """
    +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
    +            cols = cols[0]
    +
    +        if not isinstance(numBuckets, int):
    +            raise TypeError("numBuckets should be an int, got {0}.".format(type(numBuckets)))
    +
    +        if not all(isinstance(c, basestring) for c in cols):
    +            raise TypeError("cols argument should be a string or a sequence of strings.")
    +
    +        col = cols[0]
    +        cols = cols[1:]
    +
    +        self._jwrite = self._jwrite.bucketBy(numBuckets, col, _to_seq(self._spark._sc, cols))
    +        return self
    +
    +    @since(2.3)
    +    def sortBy(self, *cols):
    +        """Sorts the output in each bucket by the given columns on the file system.
    +
    +        :param cols: name of columns
    +
    +        >>> (df.write.format('parquet')
    +        ...     .bucketBy(100, 'year', 'month')
    +        ...     .sortBy('day')
    +        ...     .mode("overwrite")
    +        ...     .saveAsTable('sorted_bucketed_table'))
    +        """
    +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
    +            cols = cols[0]
    +
    +        if not all(isinstance(c, basestring) for c in cols):
    +            raise TypeError("cols argument should be a string or a sequence of strings.")
    --- End diff --
    
    same note as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    @gatorsmile #17938


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115129876
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,60 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param cols: name of columns
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    +
    +        >>> (df.write.format('parquet')
    +        ...     .bucketBy(100, 'year', 'month')
    +        ...     .mode("overwrite")
    +        ...     .saveAsTable('bucketed_table'))
    +        """
    +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
    +            cols = cols[0]
    +
    +        if not isinstance(numBuckets, int):
    +            raise TypeError("numBuckets should be an int, got {0}.".format(type(numBuckets)))
    +
    +        if not all(isinstance(c, basestring) for c in cols):
    +            raise TypeError("cols argument should be a string or a sequence of strings.")
    --- End diff --
    
    So I don't think we really support all sequences (the above typecheck on L581 requires list or tuple but there are additional types of sequences).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #75658 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75658/testReport)** for PR 17077 at commit [`7b93482`](https://github.com/apache/spark/commit/7b93482f31f2efb3d4d742eb3e385e6b4a2bc14e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r111214507
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,56 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        def count_bucketed_cols(names, table="pyspark_bucket"):
    +            """Given a sequence of column names and a table name
    +            query the catalog and return number o columns which are
    +            used for bucketing
    +            """
    +            cols = self.spark.catalog.listColumns(table)
    +            num = len([c for c in cols if c.name in names and c.isBucket])
    +            return num
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x"]), 1)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x", "y"]), 2)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort
    +        df.write.bucketBy(2, "x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x"]), 1)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with a list of columns
    +        df.write.bucketBy(3, ["x", "y"]).mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x", "y"]), 2)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with a list of columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy(["y", "z"])
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with multiple columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy("y", "z")
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket")
    --- End diff --
    
    If we're going to drop the table here we should probably put it in a final block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17077


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110626138
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2038,6 +2038,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    --- End diff --
    
    Thanks for taking a look for the related ones and trying it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #76193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76193/testReport)** for PR 17077 at commit [`5da0e0d`](https://github.com/apache/spark/commit/5da0e0dceb0125f112c068d5cc34a25a16ab30ac).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73535/testReport)** for PR 17077 at commit [`ae93166`](https://github.com/apache/spark/commit/ae93166db34d4b3ee784177972e88eea34d4936e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73527/testReport)** for PR 17077 at commit [`18c709c`](https://github.com/apache/spark/commit/18c709c4bf77fc6db5530e00a9e5bba0e1ab0250).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110693499
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    +            1
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name in ("x", "y") and c.isBucket]),
    --- End diff --
    
    If you think it is better I'll trust your judgment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76193/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    I think we should because branch-2.2 is cut out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110621829
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2038,6 +2038,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    --- End diff --
    
    We can simplify this to 
      
        catalog =  self.spark.catalog
    
        sum(c.name == "x" and c.isBucket for c in catalog.listColumns("pyspark_bucket"))
    
    f you think this is more readable but i am not convinced that it makes sense to use a separate variable here. We have a few tests like this, don't care about the sequence itself, and I think it would only pollute the scope. But if you have strong feelings about I am happy to adjust it.
    
    Regarding the comment style... Right now (excluding `bucket` by and `sortBy`) we have 
    
    - 23 docstrings with:
    
            """....
            """
    
    - 7 docstrings:
    
            """"
            ....
            """"
    
    in `readwriter`.  As you said both are valid, but if we want to keep only one convention it would be a good idea to adjust a whole module.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Thanks for helping with the review @HyukjinKwon :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    \U0001f641 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    @zero323 Could you also update the [SQL document](http://spark.apache.org/docs/latest/sql-programming-guide.html)?
    
    https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
    
    Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73522 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73522/testReport)** for PR 17077 at commit [`0ef84fb`](https://github.com/apache/spark/commit/0ef84fbb15e7cdfe0b2d8353ca315dce9b2fabfb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115133626
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,60 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param cols: name of columns
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    +
    +        >>> (df.write.format('parquet')
    +        ...     .bucketBy(100, 'year', 'month')
    +        ...     .mode("overwrite")
    +        ...     .saveAsTable('bucketed_table'))
    +        """
    +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
    +            cols = cols[0]
    +
    +        if not isinstance(numBuckets, int):
    +            raise TypeError("numBuckets should be an int, got {0}.".format(type(numBuckets)))
    +
    +        if not all(isinstance(c, basestring) for c in cols):
    +            raise TypeError("cols argument should be a string or a sequence of strings.")
    --- End diff --
    
    Good point. We can support arbitrary `Iterable[str]` though. 
    
    ```python
    if len(cols) == 1 and isinstance(cols[0], collections.abc.Iterable):
        cols = list(cols[0])
    ```
    
    Caveat is, we don't allow this anywhere else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73504/testReport)** for PR 17077 at commit [`4477493`](https://github.com/apache/spark/commit/4477493587cc8ba9b9b3696601f334827e60d2bc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75659/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #75659 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75659/testReport)** for PR 17077 at commit [`845ee87`](https://github.com/apache/spark/commit/845ee8783c54123c743e176def11af7455192d42).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115134650
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,60 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    --- End diff --
    
    Thank you for adding the wrapper. 
    
    Yes. We should make the Python APIs consistent with Scala APIs, if possible. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75666/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110692936
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    +            1
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name in ("x", "y") and c.isBucket]),
    +            2
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort
    +        df.write.bucketBy(2, "x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    +            1
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with a list of columns
    +        df.write.bucketBy(3, ["x", "y"]).mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name in ("x", "y") and c.isBucket]),
    +            2
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with a list of columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy(["y", "z"])
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with multiple columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy("y", "z")
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    --- End diff --
    
    I don't think that dropping before is necessary. We override on each write and name clashes are unlikely.
    
    We can drop down after the tests but I am not sure how to do it right. `SQLTests` is overgrown and I am not sure if we should add `tearDown`  only for this but adding `DROP TABLE` in test itself doesn't look right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75658/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73527/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #75659 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75659/testReport)** for PR 17077 at commit [`845ee87`](https://github.com/apache/spark/commit/845ee8783c54123c743e176def11af7455192d42).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    LGTM too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110624985
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2038,6 +2038,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    --- End diff --
    
    Yup, I don't argue with my personal preference. I am fine with it. I dont strongly feel about both.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110792594
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,56 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        def count_bucketed_cols(names, table="pyspark_bucket"):
    +            """Given a sequence of column names and a table name
    +            query the catalog and return number o columns which are
    +            used for bucketing
    +            """
    +            cols = self.spark.catalog.listColumns(table)
    +            num = len([c for c in cols if c.name in names and c.isBucket])
    +            return num
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x"]), 1)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x", "y"]), 2)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort
    +        df.write.bucketBy(2, "x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x"]), 1)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with a list of columns
    +        df.write.bucketBy(3, ["x", "y"]).mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(count_bucketed_cols(["x", "y"]), 2)
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with a list of columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy(["y", "z"])
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write with bucket and sort with multiple columns
    +        (df.write.bucketBy(2, "x")
    +            .sortBy("y", "z")
    +            .mode("overwrite").saveAsTable("pyspark_bucket"))
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket")
    --- End diff --
    
    Yea, I think this is a correct way to drop the table.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Also cc @cloud-fan who is the original PR author who implemented bucketBy. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115138021
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,60 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param cols: name of columns
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    +
    +        >>> (df.write.format('parquet')
    +        ...     .bucketBy(100, 'year', 'month')
    +        ...     .mode("overwrite")
    +        ...     .saveAsTable('bucketed_table'))
    +        """
    +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
    --- End diff --
    
    Why do you say that? `cols` are  variadic, so it should be always `Sized`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    (I think we need @holdenk's sign-off and further review.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110699779
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self):
             df = self.spark.createDataFrame(data, schema=schema)
             df.collect()
     
    +    def test_bucketed_write(self):
    +        data = [
    +            (1, "foo", 3.0), (2, "foo", 5.0),
    +            (3, "bar", -1.0), (4, "bar", 6.0),
    +        ]
    +        df = self.spark.createDataFrame(data, ["x", "y", "z"])
    +
    +        # Test write with one bucketing column
    +        df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name == "x" and c.isBucket]),
    +            1
    +        )
    +        self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect()))
    +
    +        # Test write two bucketing columns
    +        df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket")
    +        self.assertEqual(
    +            len([c for c in self.spark.catalog.listColumns("pyspark_bucket")
    +                 if c.name in ("x", "y") and c.isBucket]),
    --- End diff --
    
    Copying docs from Scala docs directly could be confusing since we won't support this in 2.0 and 2.1 and changes since 2.0 doesn't really affect us here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Thanks for cc'ing me. Let me please @davies as he was reviewing it and it seems close to be merged and also @holdenk.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115153524
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,63 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, col, *cols):
    +        """Buckets the output by the given columns.If specified,
    --- End diff --
    
    Nit: `columns.If` -> `columns. If`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73504/testReport)** for PR 17077 at commit [`4477493`](https://github.com/apache/spark/commit/4477493587cc8ba9b9b3696601f334827e60d2bc).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #76193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76193/testReport)** for PR 17077 at commit [`5da0e0d`](https://github.com/apache/spark/commit/5da0e0dceb0125f112c068d5cc34a25a16ab30ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    @gatorsmile 
    
    >  Could you also update the SQL document?
    
    Sure, but I'll need some guidance here. Somewhere in the [Generic Load/Save Functions](https://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions), right? But I guess we'll need a separate section for that.  And should probably document `partitionBy`as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Yes. You can create a new section to explain how to create a bucket tables. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r110630404
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -545,6 +545,57 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.2)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    --- End diff --
    
    I think just copying it from Scala doc is good enough to prevent overhead of sweeping the documentation when we start to support other operations later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    @holdenk, @HyukjinKwon Do we retarget this to 2.3?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #73535 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73535/testReport)** for PR 17077 at commit [`ae93166`](https://github.com/apache/spark/commit/ae93166db34d4b3ee784177972e88eea34d4936e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

Posted by zero323 <gi...@git.apache.org>.

Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17077#discussion_r115138060
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -563,6 +563,60 @@ def partitionBy(self, *cols):
             self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
             return self
     
    +    @since(2.3)
    +    def bucketBy(self, numBuckets, *cols):
    +        """Buckets the output by the given columns on the file system.
    +
    +        :param numBuckets: the number of buckets to save
    +        :param cols: name of columns
    +
    +        .. note:: Applicable for file-based data sources in combination with
    +                  :py:meth:`DataFrameWriter.saveAsTable`.
    +
    +        >>> (df.write.format('parquet')
    +        ...     .bucketBy(100, 'year', 'month')
    +        ...     .mode("overwrite")
    +        ...     .saveAsTable('bucketed_table'))
    +        """
    +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
    +            cols = cols[0]
    +
    +        if not isinstance(numBuckets, int):
    +            raise TypeError("numBuckets should be an int, got {0}.".format(type(numBuckets)))
    +
    +        if not all(isinstance(c, basestring) for c in cols):
    +            raise TypeError("cols argument should be a string or a sequence of strings.")
    --- End diff --
    
    Or we just replace error message with:
    
    ```
    "cols argument should be a string, List[str] or Tuple[str, ...]"
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17077
  
    **[Test build #75667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75667/testReport)** for PR 17077 at commit [`71c9e0f`](https://github.com/apache/spark/commit/71c9e0faf39b979eb7f61d74af8c1821d0a0bcf3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org