You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by drboyer <gi...@git.apache.org> on 2018/02/22 05:28:49 UTC

[GitHub] spark pull request #20658: [SPARK-23488][python] Add missing catalog methods...

GitHub user drboyer opened a pull request:

    https://github.com/apache/spark/pull/20658

    [SPARK-23488][python] Add missing catalog methods to python API

    ## What changes were proposed in this pull request?
    
    As noted in SPARK-23488, the Python Catalog API was missing some methods that are present in the Scala API. Both for the sake of consistency, and because of their utility, I've added the missing methods (the database/table/functionExists() and getDatabase/Table/Function() methods).
    
    I modeled these methods off of how the existing ones were written in catalog.py.
    
    ## How was this patch tested?
    
    manually tested the added methods and compared functionality to the scala API counterparts
    
    ## Questions
    
    I wasn't sure whether to set the `@since(x.y)` decorators on the new functions, and if so, what value to set them to?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/drboyer/spark add-other-catalog-methods

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20658.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20658
    
----
commit c82394a87d1a36b3fc899553433119d5ca8bc28f
Author: Devin Boyer <dr...@...>
Date:   2018-02-22T05:06:37Z

    add missing catalog methods to python API

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    **[Test build #87647 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87647/testReport)** for PR 20658 at commit [`a49ffa0`](https://github.com/apache/spark/commit/a49ffa010a46a0d87de124d8ddf66c8173b756fb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    cc @ueshin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20658: [SPARK-23488][python] Add missing catalog methods...

Posted by drboyer <gi...@git.apache.org>.
Github user drboyer commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20658#discussion_r170158422
  
    --- Diff: python/pyspark/sql/catalog.py ---
    @@ -28,7 +28,7 @@
     Database = namedtuple("Database", "name description locationUri")
     Table = namedtuple("Table", "name database description tableType isTemporary")
     Column = namedtuple("Column", "name description dataType nullable isPartition isBucket")
    -Function = namedtuple("Function", "name description className isTemporary")
    +Function = namedtuple("Function", "name database description className isTemporary")
    --- End diff --
    
    Ah yes, `database` is in the [Scala api](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function) so I added it in for the sake of completeness, but I'm happy to remove it if there's a concern.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by drboyer <gi...@git.apache.org>.
Github user drboyer commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    @HyukjinKwon thanks for the review so far! Sorry for the delay, I somehow missed the Python style output in the test logs earlier. How's this look now?
    
    Can you elaborate more on "doctest" if it's still needed? From what I can tell the only documentation for the Catalog is [this simple reference](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=catalog#pyspark.sql.SparkSession.catalog) which would be unaffected by my change


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87647/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    **[Test build #87647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87647/testReport)** for PR 20658 at commit [`a49ffa0`](https://github.com/apache/spark/commit/a49ffa010a46a0d87de124d8ddf66c8173b756fb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    **[Test build #87627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87627/testReport)** for PR 20658 at commit [`d7e03cd`](https://github.com/apache/spark/commit/d7e03cd507b58fda8830b580390f5224ee7c8d65).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Actually, I think I need some more feedback about matching https://github.com/apache/spark/pull/20658#discussion_r170003442. 
    
    @rxin, @hvanhovell and @cloud-fan (from checking the history), 
    
    the problem here seems Scala side of `Function`:
    
    https://github.com/apache/spark/blob/04d417a7ca8ef694658b26fb697a035717414731/sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala#L129-L135
    
    has `database`. Was checking the history. Looks this was added later alone and was missed in Python one:
    
    https://github.com/apache/spark/blob/a49ffa010a46a0d87de124d8ddf66c8173b756fb/python/pyspark/sql/catalog.py#L31
    
    We can't directly add it because it breaks compatibility, for example, if we go as below:
    
    ```
    Function = namedtuple("Function", "name database description className isTemporary") 
    ```
    
    It breaks
    
    ```python
    >>> Function("a","b", "c", "d")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: __new__() takes exactly 6 arguments (5 given)
    ```
    
    We could workaround it by a hack to inject a default value to allow the case above but it is still a problem, for example, if we call like `len` and then length will be changed.
    
    Do you guys think it's okay to just add it as below with a release/migration note?
    
    ```diff
    - Function = namedtuple("Function", "name description className isTemporary") 
    + Function = namedtuple("Function", "name database description className isTemporary") 
    ```
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by drboyer <gi...@git.apache.org>.
Github user drboyer commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Thanks @holdenk, I can open a separate JIRA about the missing field in `Function` if it seems worth fixing. It wasn't critical for me, I just happened to notice while doing some testing so I included it in my inital commit.
    
    I hadn't added more complex docstrings just since these seemed like pretty simple methods with straightforward parameters. Happy to add :param: and :return: annotations if desired, but should we add these to some of the other catalog methods as well if we're adding it to these new ones (thinking especially of the `list*()` methods)?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    **[Test build #87627 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87627/testReport)** for PR 20658 at commit [`d7e03cd`](https://github.com/apache/spark/commit/d7e03cd507b58fda8830b580390f5224ee7c8d65).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20658: [SPARK-23488][python] Add missing catalog methods...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20658#discussion_r170003647
  
    --- Diff: python/pyspark/sql/catalog.py ---
    @@ -137,6 +138,78 @@ def listColumns(self, tableName, dbName=None):
                     isBucket=jcolumn.isBucket()))
             return columns
     
    +    @ignore_unicode_prefix
    +    # TODO: @since() decorator?
    --- End diff --
    
    I think `@since(2.4)`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87627/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20658: [SPARK-23488][python] Add missing catalog methods to pyt...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/20658
  
    So it looks like adding the database field has been backed out for this change, which is probably the right thing to do, if we want to make that change we should do that in a separate JIRA & commit so it can be rolled back separately from the new methods.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20658: [SPARK-23488][python] Add missing catalog methods...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20658#discussion_r170003442
  
    --- Diff: python/pyspark/sql/catalog.py ---
    @@ -28,7 +28,7 @@
     Database = namedtuple("Database", "name description locationUri")
     Table = namedtuple("Table", "name database description tableType isTemporary")
     Column = namedtuple("Column", "name description dataType nullable isPartition isBucket")
    -Function = namedtuple("Function", "name description className isTemporary")
    +Function = namedtuple("Function", "name database description className isTemporary")
    --- End diff --
    
    Hm, wouldn't this break backward compatibility?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org