You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2023/06/19 00:32:02 UTC
[spark] branch master updated: [SPARK-43009][PYTHON][FOLLOWUP] Parameterized `sql_formatter.sql()` with Any constants
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 53dae3d0440 [SPARK-43009][PYTHON][FOLLOWUP] Parameterized `sql_formatter.sql()` with Any constants
53dae3d0440 is described below
commit 53dae3d0440f5acad1fd30b17fe27ed208860960
Author: Max Gekk <ma...@gmail.com>
AuthorDate: Mon Jun 19 09:31:50 2023 +0900
[SPARK-43009][PYTHON][FOLLOWUP] Parameterized `sql_formatter.sql()` with Any constants
### What changes were proposed in this pull request?
In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in `sql_formatter`. Language API can accept `Any` objects from which it is possible to construct literal expressions.
### Why are the changes needed?
To align the API to PySpark's `sql()`.
And the current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues:
1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input.
2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'`
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
By running the affected test suite:
```
$ python/run-tests --parallelism=1 --testnames 'pyspark.pandas.sql_formatter'
```
Closes #41644 from MaxGekk/fix-pandas-sql_formatter.
Authored-by: Max Gekk <ma...@gmail.com>
Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
python/pyspark/pandas/sql_formatter.py | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/python/pyspark/pandas/sql_formatter.py b/python/pyspark/pandas/sql_formatter.py
index f87dd3ff29f..4387a1e0909 100644
--- a/python/pyspark/pandas/sql_formatter.py
+++ b/python/pyspark/pandas/sql_formatter.py
@@ -43,7 +43,7 @@ _CAPTURE_SCOPES = 3
def sql(
query: str,
index_col: Optional[Union[str, List[str]]] = None,
- args: Dict[str, str] = {},
+ args: Optional[Dict[str, Any]] = None,
**kwargs: Any,
) -> DataFrame:
"""
@@ -103,10 +103,14 @@ def sql(
Also note that the index name(s) should be matched to the existing name.
args : dict
- A dictionary of parameter names to string values that are parsed as SQL literal
- expressions. For example, dict keys: "rank", "name", "birthdate"; dict values:
- "1", "'Steven'", "DATE'2023-03-21'". The fragments of string values belonged to SQL
- comments are skipped while parsing.
+ A dictionary of parameter names to Python objects that can be converted to
+ SQL literal expressions. See
+ <a href="https://spark.apache.org/docs/latest/sql-ref-datatypes.html">
+ Supported Data Types</a> for supported value types in Python.
+ For example, dictionary keys: "rank", "name", "birthdate";
+ dictionary values: 1, "Steven", datetime.date(2023, 4, 2).
+ Dict value can be also a `Column` of literal expression, in that case it is taken as is.
+
.. versionadded:: 3.4.0
@@ -166,7 +170,7 @@ def sql(
And substitude named parameters with the `:` prefix by SQL literals.
- >>> ps.sql("SELECT * FROM range(10) WHERE id > :bound1", args={"bound1":"7"})
+ >>> ps.sql("SELECT * FROM range(10) WHERE id > :bound1", args={"bound1":7})
id
0 8
1 9
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org