You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2020/06/03 03:10:31 UTC
[spark] branch branch-3.0 updated: [SPARK-31895][PYTHON][SQL]
Support DataFrame.explain(extended: str) case to be consistent with Scala
side
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.0 by this push:
new 8cea968 [SPARK-31895][PYTHON][SQL] Support DataFrame.explain(extended: str) case to be consistent with Scala side
8cea968 is described below
commit 8cea968596baec6c71fe2c2bd4f5469b33b583e9
Author: HyukjinKwon <gu...@apache.org>
AuthorDate: Wed Jun 3 12:07:05 2020 +0900
[SPARK-31895][PYTHON][SQL] Support DataFrame.explain(extended: str) case to be consistent with Scala side
### What changes were proposed in this pull request?
Scala:
```scala
scala> spark.range(10).explain("cost")
```
```
== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B)
== Physical Plan ==
*(1) Range (0, 10, step=1, splits=12)
```
PySpark:
```python
>>> spark.range(10).explain("cost")
```
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 333, in explain
raise TypeError(err_msg)
TypeError: extended (optional) should be provided as bool, got <class 'str'>
```
In addition, it is consistent with other codes too, for example, `DataFrame.sample` also can support `DataFrame.sample(1.0)` and `DataFrame.sample(False)`.
### Why are the changes needed?
To provide the consistent API support across APIs.
### Does this PR introduce _any_ user-facing change?
Nope, it's only changes in unreleased branches.
If this lands to master only, yes, users will be able to set `mode` as `df.explain("...")` in Spark 3.1.
After this PR:
```python
>>> spark.range(10).explain("cost")
```
```
== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B)
== Physical Plan ==
*(1) Range (0, 10, step=1, splits=12)
```
### How was this patch tested?
Unittest was added and manually tested as well to make sure:
```python
spark.range(10).explain(True)
spark.range(10).explain(False)
spark.range(10).explain("cost")
spark.range(10).explain(extended="cost")
spark.range(10).explain(mode="cost")
spark.range(10).explain()
spark.range(10).explain(True, "cost")
spark.range(10).explain(1.0)
```
Closes #28711 from HyukjinKwon/SPARK-31895.
Authored-by: HyukjinKwon <gu...@apache.org>
Signed-off-by: HyukjinKwon <gu...@apache.org>
(cherry picked from commit e1d52011401c1989f26b230eb8c82adc63e147e7)
Signed-off-by: HyukjinKwon <gu...@apache.org>
---
python/pyspark/sql/dataframe.py | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 6f4fdd3..8ba2ffa 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -276,6 +276,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
"""Prints the (logical and physical) plans to the console for debugging purpose.
:param extended: boolean, default ``False``. If ``False``, prints only the physical plan.
+ When this is a string without specifying the ``mode``, it works as the mode is
+ specified.
:param mode: specifies the expected output format of plans.
* ``simple``: Print only a physical plan.
@@ -306,12 +308,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
Output [2]: [age#0, name#1]
...
+ >>> df.explain("cost")
+ == Optimized Logical Plan ==
+ ...Statistics...
+ ...
+
.. versionchanged:: 3.0.0
Added optional argument `mode` to specify the expected output format of plans.
"""
if extended is not None and mode is not None:
- raise Exception("extended and mode can not be specified simultaneously")
+ raise Exception("extended and mode should not be set together.")
# For the no argument case: df.explain()
is_no_argument = extended is None and mode is None
@@ -319,18 +326,22 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
# For the cases below:
# explain(True)
# explain(extended=False)
- is_extended_case = extended is not None and isinstance(extended, bool)
+ is_extended_case = isinstance(extended, bool) and mode is None
- # For the mode specified: df.explain(mode="formatted")
- is_mode_case = mode is not None and isinstance(mode, basestring)
+ # For the case when extended is mode:
+ # df.explain("formatted")
+ is_extended_as_mode = isinstance(extended, basestring) and mode is None
- if not is_no_argument and not (is_extended_case or is_mode_case):
- if extended is not None:
- err_msg = "extended (optional) should be provided as bool" \
- ", got {0}".format(type(extended))
- else: # For mode case
- err_msg = "mode (optional) should be provided as str, got {0}".format(type(mode))
- raise TypeError(err_msg)
+ # For the mode specified:
+ # df.explain(mode="formatted")
+ is_mode_case = extended is None and isinstance(mode, basestring)
+
+ if not (is_no_argument or is_extended_case or is_extended_as_mode or is_mode_case):
+ argtypes = [
+ str(type(arg)) for arg in [extended, mode] if arg is not None]
+ raise TypeError(
+ "extended (optional) and mode (optional) should be a string "
+ "and bool; however, got [%s]." % ", ".join(argtypes))
# Sets an explain mode depending on a given argument
if is_no_argument:
@@ -339,6 +350,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
explain_mode = "extended" if extended else "simple"
elif is_mode_case:
explain_mode = mode
+ elif is_extended_as_mode:
+ explain_mode = extended
print(self._sc._jvm.PythonSQLUtils.explainString(self._jdf.queryExecution(), explain_mode))
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org