You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/02 04:52:06 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

HyukjinKwon opened a new pull request #34774:
URL: https://github.com/apache/spark/pull/34774


   ### What changes were proposed in this pull request?
   
   This PR proposes to use [Python's standard string formatter](https://docs.python.org/3/library/string.html#custom-string-formatting) in `SparkSesiion.sql`, see also https://github.com/apache/spark/pull/34677.
   
   ### Why are the changes needed?
   
   To improve usability in PySpark. It works together with Python standard string formatter.
   
   ### Does this PR introduce _any_ user-facing change?
   
   By default, there is no user-facing change. If `kwargs` is specified, yes.
   
   1. Attribute supports from frame (standard Python support):
   
       ```python
       mydf = ps.range(10)
       ps.sql("SELECT {tbl.id}, {tbl[id]} FROM {tbl}", tbl=mydf)
       ```
   
   2. Understanding `DataFrame`:
   
       ```python
       mydf = ps.range(10)
       ps.sql("SELECT * FROM {tbl}", tbl=mydf)
       ```
   
   3. Understanding `Column`:
   
       ```python
       mydf = ps.range(10)
       ps.sql("SELECT {c} FROM {tbl}", c=col("id") tbl=mydf)
       ```
   
   4. Leveraging other Python string format:
   
       ```python
       mydf = spark.range(10)
       spark.sql(
           "SELECT {col} FROM {mydf} WHERE id IN {x}",
           col=mydf.id, mydf=mydf, x=tuple(range(4)))
       ```
   
   ### How was this patch tested?
   
   Doctests were added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #34774:
URL: https://github.com/apache/spark/pull/34774#discussion_r760888396



##########
File path: python/pyspark/pandas/sql_formatter.py
##########
@@ -163,7 +163,7 @@ def sql(
         return sql_processor.sql(query, index_col=index_col, **kwargs)
 
     session = default_session()
-    formatter = SQLStringFormatter(session)
+    formatter = PandasSQLStringFormatter(session)

Review comment:
       Just curious, is this (renaming) a related change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984331468


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145843/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

zero323 commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-987958331


   In general LGTM. The only thing ‒ given
   
   > Let me keep it only in API documentation first .. I would like to avoid promoting this support a lot for now .. but keep it unstable and experimental.
   
   shouldn't we keep old `sql` docstring for now?  The new one seems to imply more than we guarantee right now.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985114040


   **[Test build #145867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145867/testReport)** for PR 34774 at commit [`1392b25`](https://github.com/apache/spark/commit/1392b252362faa88a3a76d38ce260c4e69aa4bd8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985138929


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145867/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985096653


   > I notice that `kwargs` of both SQLStringFormatter and PandasSQLStringFormatter is of `Mapping[str, Any]`. They seem to accept different types of `kwargs` though. Shall we be more specific about the type, which may also act as documentation? I am also fine with the current typing.
   
   Actually `kwargs` seems requiring to have value's type only (see https://www.python.org/dev/peps/pep-0484/#arbitrary-argument-lists-and-default-argument-values). Since it can be any type, `Any` looks correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #34774:
URL: https://github.com/apache/spark/pull/34774#discussion_r760889563



##########
File path: python/pyspark/pandas/sql_formatter.py
##########
@@ -163,7 +163,7 @@ def sql(
         return sql_processor.sql(query, index_col=index_col, **kwargs)
 
     session = default_session()
-    formatter = SQLStringFormatter(session)
+    formatter = PandasSQLStringFormatter(session)

Review comment:
       Oh, I see, nvm. I found new `SQLStringFormatter` below.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985164107


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34774:
URL: https://github.com/apache/spark/pull/34774#discussion_r760761625



##########
File path: python/pyspark/pandas/tests/test_sql.py
##########
@@ -26,10 +26,6 @@ def test_error_variable_not_exist(self):
         with self.assertRaisesRegex(KeyError, "variable_foo"):
             ps.sql("select * from {variable_foo}")
 
-    def test_error_unsupported_type(self):

Review comment:
       Removed the duplicate test by mistake (see above)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984324189


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50318/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984360335


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50318/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984320572


   **[Test build #145843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145843/testReport)** for PR 34774 at commit [`b14db3d`](https://github.com/apache/spark/commit/b14db3d31491cdb85401046371613912b99b84dd).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class PandasSQLStringFormatter(string.Formatter):`
     * `class SQLStringFormatter(string.Formatter):`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #34774:
URL: https://github.com/apache/spark/pull/34774#discussion_r760887767



##########
File path: python/pyspark/sql/session.py
##########
@@ -915,23 +916,100 @@ def prepare(obj):
         df._schema = struct
         return df
 
-    def sql(self, sqlQuery: str) -> DataFrame:
+    def sql(self, sqlQuery: str, **kwargs: Any) -> DataFrame:
         """Returns a :class:`DataFrame` representing the result of the given query.
+        When ``kwargs`` is specified, this method formats the given string by using the Python
+        standard formatter.
 
         .. versionadded:: 2.0.0
 
+        Parameters
+        ----------
+        sqlQuery : str
+            SQL query string.
+        kwargs : dict
+            Other variables that the user want to set that can be referenced in the query

Review comment:
       nit: s/want/wants/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #34774:
URL: https://github.com/apache/spark/pull/34774


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34774:
URL: https://github.com/apache/spark/pull/34774#discussion_r761553931



##########
File path: python/pyspark/sql/sql_formatter.py
##########
@@ -0,0 +1,84 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import string
+import typing
+from typing import Any, Optional, List, Tuple, Sequence, Mapping
+import uuid
+
+from py4j.java_gateway import is_instance_of
+
+if typing.TYPE_CHECKING:
+    from pyspark.sql import SparkSession, DataFrame
+from pyspark.sql.functions import lit
+
+
+class SQLStringFormatter(string.Formatter):
+    """
+    A standard ``string.Formatter`` in Python that can understand PySpark instances
+    with basic Python objects. This object has to be clear after the use for single SQL
+    query; cannot be reused across multiple SQL queries without cleaning.
+    """
+
+    def __init__(self, session: "SparkSession") -> None:
+        self._session: "SparkSession" = session
+        self._temp_views: List[Tuple[DataFrame, str]] = []
+
+    def get_field(self, field_name: str, args: Sequence[Any], kwargs: Mapping[str, Any]) -> Any:
+        obj, first = super(SQLStringFormatter, self).get_field(field_name, args, kwargs)
+        return self._convert_value(obj, field_name), first
+
+    def _convert_value(self, val: Any, field_name: str) -> Optional[str]:
+        """
+        Converts the given value into a SQL string.
+        """
+        from pyspark import SparkContext
+        from pyspark.sql import Column, DataFrame
+
+        if isinstance(val, Column):
+            assert SparkContext._gateway is not None  # type: ignore[attr-defined]
+
+            gw = SparkContext._gateway  # type: ignore[attr-defined]
+            jexpr = val._jc.expr()
+            if is_instance_of(
+                gw, jexpr, "org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute"
+            ) or is_instance_of(
+                gw, jexpr, "org.apache.spark.sql.catalyst.expressions.AttributeReference"
+            ):
+                return jexpr.sql()

Review comment:
       I was thinking about this case too .. but it won't work for many cases such as:
   
   ```python
   >>> from pyspark.sql import functions
   >>> functions.col("a").alias("b").alias("c")._jc.expr().sql()
   'a AS b AS c'
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-987430366


   Maybe I will merge it if there aren't more comments in few more details ... but to double check,
   cc @BryanCutler @holdenk @zero323 WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984292523


   cc @xinrong-databricks too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984331468


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145843/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984307383


   **[Test build #145843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145843/testReport)** for PR 34774 at commit [`b14db3d`](https://github.com/apache/spark/commit/b14db3d31491cdb85401046371613912b99b84dd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon edited a comment on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon edited a comment on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985099401


   > Is there any place in PySpark document we should mention this?
   
   Let me keep it only in API documentation first .. I would like to avoid promoting this support a lot for now .. but keep it unstable and experimental.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985132327


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-988498834


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-988362830


   > shouldn't we keep old `sql` docstring for now? The new one seems to imply more than we guarantee right now.
   
   Let me keep the examples in API documentation for now though. I just would like to avoid documenting this publicly like pandas UDFs .. at least this is the same approach I took for `DataFrame.mapInArrow` ..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984307383


   **[Test build #145843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145843/testReport)** for PR 34774 at commit [`b14db3d`](https://github.com/apache/spark/commit/b14db3d31491cdb85401046371613912b99b84dd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984360335


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50318/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #34774:
URL: https://github.com/apache/spark/pull/34774#discussion_r761573368



##########
File path: python/pyspark/sql/sql_formatter.py
##########
@@ -0,0 +1,84 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import string
+import typing
+from typing import Any, Optional, List, Tuple, Sequence, Mapping
+import uuid
+
+from py4j.java_gateway import is_instance_of
+
+if typing.TYPE_CHECKING:
+    from pyspark.sql import SparkSession, DataFrame
+from pyspark.sql.functions import lit
+
+
+class SQLStringFormatter(string.Formatter):
+    """
+    A standard ``string.Formatter`` in Python that can understand PySpark instances
+    with basic Python objects. This object has to be clear after the use for single SQL
+    query; cannot be reused across multiple SQL queries without cleaning.
+    """
+
+    def __init__(self, session: "SparkSession") -> None:
+        self._session: "SparkSession" = session
+        self._temp_views: List[Tuple[DataFrame, str]] = []
+
+    def get_field(self, field_name: str, args: Sequence[Any], kwargs: Mapping[str, Any]) -> Any:
+        obj, first = super(SQLStringFormatter, self).get_field(field_name, args, kwargs)
+        return self._convert_value(obj, field_name), first
+
+    def _convert_value(self, val: Any, field_name: str) -> Optional[str]:
+        """
+        Converts the given value into a SQL string.
+        """
+        from pyspark import SparkContext
+        from pyspark.sql import Column, DataFrame
+
+        if isinstance(val, Column):
+            assert SparkContext._gateway is not None  # type: ignore[attr-defined]
+
+            gw = SparkContext._gateway  # type: ignore[attr-defined]
+            jexpr = val._jc.expr()
+            if is_instance_of(
+                gw, jexpr, "org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute"
+            ) or is_instance_of(
+                gw, jexpr, "org.apache.spark.sql.catalyst.expressions.AttributeReference"
+            ):
+                return jexpr.sql()

Review comment:
       I see. As this is unstable and experimental right now, it seems okay. We may probably address this later.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985150815


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985164107


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50342/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985114040


   **[Test build #145867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145867/testReport)** for PR 34774 at commit [`1392b25`](https://github.com/apache/spark/commit/1392b252362faa88a3a76d38ce260c4e69aa4bd8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985138929


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145867/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984351389


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50318/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #34774:
URL: https://github.com/apache/spark/pull/34774#discussion_r760897993



##########
File path: python/pyspark/sql/sql_formatter.py
##########
@@ -0,0 +1,84 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import string
+import typing
+from typing import Any, Optional, List, Tuple, Sequence, Mapping
+import uuid
+
+from py4j.java_gateway import is_instance_of
+
+if typing.TYPE_CHECKING:
+    from pyspark.sql import SparkSession, DataFrame
+from pyspark.sql.functions import lit
+
+
+class SQLStringFormatter(string.Formatter):
+    """
+    A standard ``string.Formatter`` in Python that can understand PySpark instances
+    with basic Python objects. This object has to be clear after the use for single SQL
+    query; cannot be reused across multiple SQL queries without cleaning.
+    """
+
+    def __init__(self, session: "SparkSession") -> None:
+        self._session: "SparkSession" = session
+        self._temp_views: List[Tuple[DataFrame, str]] = []
+
+    def get_field(self, field_name: str, args: Sequence[Any], kwargs: Mapping[str, Any]) -> Any:
+        obj, first = super(SQLStringFormatter, self).get_field(field_name, args, kwargs)
+        return self._convert_value(obj, field_name), first
+
+    def _convert_value(self, val: Any, field_name: str) -> Optional[str]:
+        """
+        Converts the given value into a SQL string.
+        """
+        from pyspark import SparkContext
+        from pyspark.sql import Column, DataFrame
+
+        if isinstance(val, Column):
+            assert SparkContext._gateway is not None  # type: ignore[attr-defined]
+
+            gw = SparkContext._gateway  # type: ignore[attr-defined]
+            jexpr = val._jc.expr()
+            if is_instance_of(
+                gw, jexpr, "org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute"
+            ) or is_instance_of(
+                gw, jexpr, "org.apache.spark.sql.catalyst.expressions.AttributeReference"
+            ):
+                return jexpr.sql()

Review comment:
       Just wondering if `sql()` is working for all expressions to directly put into sql string?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-984416456


   Looks great! I notice that `kwargs` of both SQLStringFormatter and PandasSQLStringFormatter is of `Mapping[str, Any]`. They seem to accept different types of `kwargs` though. Shall we be more specific about the type, which may also act as documentation? I am also fine with the current typing.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34774:
URL: https://github.com/apache/spark/pull/34774#discussion_r761553700



##########
File path: python/pyspark/sql/sql_formatter.py
##########
@@ -0,0 +1,84 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import string
+import typing
+from typing import Any, Optional, List, Tuple, Sequence, Mapping
+import uuid
+
+from py4j.java_gateway import is_instance_of
+
+if typing.TYPE_CHECKING:
+    from pyspark.sql import SparkSession, DataFrame
+from pyspark.sql.functions import lit
+
+
+class SQLStringFormatter(string.Formatter):
+    """
+    A standard ``string.Formatter`` in Python that can understand PySpark instances
+    with basic Python objects. This object has to be clear after the use for single SQL
+    query; cannot be reused across multiple SQL queries without cleaning.
+    """
+
+    def __init__(self, session: "SparkSession") -> None:
+        self._session: "SparkSession" = session
+        self._temp_views: List[Tuple[DataFrame, str]] = []
+
+    def get_field(self, field_name: str, args: Sequence[Any], kwargs: Mapping[str, Any]) -> Any:
+        obj, first = super(SQLStringFormatter, self).get_field(field_name, args, kwargs)
+        return self._convert_value(obj, field_name), first
+
+    def _convert_value(self, val: Any, field_name: str) -> Optional[str]:
+        """
+        Converts the given value into a SQL string.
+        """
+        from pyspark import SparkContext
+        from pyspark.sql import Column, DataFrame
+
+        if isinstance(val, Column):
+            assert SparkContext._gateway is not None  # type: ignore[attr-defined]
+
+            gw = SparkContext._gateway  # type: ignore[attr-defined]
+            jexpr = val._jc.expr()
+            if is_instance_of(
+                gw, jexpr, "org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute"
+            ) or is_instance_of(
+                gw, jexpr, "org.apache.spark.sql.catalyst.expressions.AttributeReference"
+            ):
+                return jexpr.sql()

Review comment:
       I was thinking about this approach too .. but seems like it won't work for many cases such as:
   
   ```python
   >>> from pyspark.sql import functions
   >>> functions.col("a").alias("b").alias("c")._jc.expr().sql()
   'a AS b AS c'
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985099401


   > Is there any place in PySpark document we should mention this?
   Let me keep it only in API documentation first .. I would like to avoid promoting this support a lot for now .. but keep it unstable and experimental.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34774: [SPARK-37516][PYTHON][SQL] Uses Python's standard string formatter for SQL API in PySpark

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34774:
URL: https://github.com/apache/spark/pull/34774#issuecomment-985128304


   **[Test build #145867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145867/testReport)** for PR 34774 at commit [`1392b25`](https://github.com/apache/spark/commit/1392b252362faa88a3a76d38ce260c4e69aa4bd8).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class PandasSQLStringFormatter(string.Formatter):`
     * `class SQLStringFormatter(string.Formatter):`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org