You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2020/06/03 03:10:31 UTC

[spark] branch branch-3.0 updated: [SPARK-31895][PYTHON][SQL] Support DataFrame.explain(extended: str) case to be consistent with Scala side

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 8cea968  [SPARK-31895][PYTHON][SQL] Support DataFrame.explain(extended: str) case to be consistent with Scala side
8cea968 is described below

commit 8cea968596baec6c71fe2c2bd4f5469b33b583e9
Author: HyukjinKwon <gu...@apache.org>
AuthorDate: Wed Jun 3 12:07:05 2020 +0900

    [SPARK-31895][PYTHON][SQL] Support DataFrame.explain(extended: str) case to be consistent with Scala side
    
    ### What changes were proposed in this pull request?
    
    Scala:
    
    ```scala
    scala> spark.range(10).explain("cost")
    ```
    ```
    == Optimized Logical Plan ==
    Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B)
    
    == Physical Plan ==
    *(1) Range (0, 10, step=1, splits=12)
    ```
    
    PySpark:
    
    ```python
    >>> spark.range(10).explain("cost")
    ```
    ```
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../spark/python/pyspark/sql/dataframe.py", line 333, in explain
        raise TypeError(err_msg)
    TypeError: extended (optional) should be provided as bool, got <class 'str'>
    ```
    
    In addition, it is consistent with other codes too, for example, `DataFrame.sample` also can support `DataFrame.sample(1.0)` and `DataFrame.sample(False)`.
    
    ### Why are the changes needed?
    
    To provide the consistent API support across APIs.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Nope, it's only changes in unreleased branches.
    If this lands to master only, yes, users will be able to set `mode` as `df.explain("...")` in Spark 3.1.
    
    After this PR:
    
    ```python
    >>> spark.range(10).explain("cost")
    ```
    ```
    == Optimized Logical Plan ==
    Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B)
    
    == Physical Plan ==
    *(1) Range (0, 10, step=1, splits=12)
    ```
    
    ### How was this patch tested?
    
    Unittest was added and manually tested as well to make sure:
    
    ```python
    spark.range(10).explain(True)
    spark.range(10).explain(False)
    spark.range(10).explain("cost")
    spark.range(10).explain(extended="cost")
    spark.range(10).explain(mode="cost")
    spark.range(10).explain()
    spark.range(10).explain(True, "cost")
    spark.range(10).explain(1.0)
    ```
    
    Closes #28711 from HyukjinKwon/SPARK-31895.
    
    Authored-by: HyukjinKwon <gu...@apache.org>
    Signed-off-by: HyukjinKwon <gu...@apache.org>
    (cherry picked from commit e1d52011401c1989f26b230eb8c82adc63e147e7)
    Signed-off-by: HyukjinKwon <gu...@apache.org>
---
 python/pyspark/sql/dataframe.py | 35 ++++++++++++++++++++++++-----------
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 6f4fdd3..8ba2ffa 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -276,6 +276,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
         """Prints the (logical and physical) plans to the console for debugging purpose.
 
         :param extended: boolean, default ``False``. If ``False``, prints only the physical plan.
+            When this is a string without specifying the ``mode``, it works as the mode is
+            specified.
         :param mode: specifies the expected output format of plans.
 
             * ``simple``: Print only a physical plan.
@@ -306,12 +308,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
         Output [2]: [age#0, name#1]
         ...
 
+        >>> df.explain("cost")
+        == Optimized Logical Plan ==
+        ...Statistics...
+        ...
+
         .. versionchanged:: 3.0.0
            Added optional argument `mode` to specify the expected output format of plans.
         """
 
         if extended is not None and mode is not None:
-            raise Exception("extended and mode can not be specified simultaneously")
+            raise Exception("extended and mode should not be set together.")
 
         # For the no argument case: df.explain()
         is_no_argument = extended is None and mode is None
@@ -319,18 +326,22 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
         # For the cases below:
         #   explain(True)
         #   explain(extended=False)
-        is_extended_case = extended is not None and isinstance(extended, bool)
+        is_extended_case = isinstance(extended, bool) and mode is None
 
-        # For the mode specified: df.explain(mode="formatted")
-        is_mode_case = mode is not None and isinstance(mode, basestring)
+        # For the case when extended is mode:
+        #   df.explain("formatted")
+        is_extended_as_mode = isinstance(extended, basestring) and mode is None
 
-        if not is_no_argument and not (is_extended_case or is_mode_case):
-            if extended is not None:
-                err_msg = "extended (optional) should be provided as bool" \
-                    ", got {0}".format(type(extended))
-            else:  # For mode case
-                err_msg = "mode (optional) should be provided as str, got {0}".format(type(mode))
-            raise TypeError(err_msg)
+        # For the mode specified:
+        #   df.explain(mode="formatted")
+        is_mode_case = extended is None and isinstance(mode, basestring)
+
+        if not (is_no_argument or is_extended_case or is_extended_as_mode or is_mode_case):
+            argtypes = [
+                str(type(arg)) for arg in [extended, mode] if arg is not None]
+            raise TypeError(
+                "extended (optional) and mode (optional) should be a string "
+                "and bool; however, got [%s]." % ", ".join(argtypes))
 
         # Sets an explain mode depending on a given argument
         if is_no_argument:
@@ -339,6 +350,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
             explain_mode = "extended" if extended else "simple"
         elif is_mode_case:
             explain_mode = mode
+        elif is_extended_as_mode:
+            explain_mode = extended
 
         print(self._sc._jvm.PythonSQLUtils.explainString(self._jdf.queryExecution(), explain_mode))
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org