You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "ueshin (via GitHub)" <gi...@apache.org> on 2023/04/28 02:53:46 UTC

[GitHub] [spark] ueshin opened a new pull request, #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

ueshin opened a new pull request, #40988:
URL: https://github.com/apache/spark/pull/40988

   ### What changes were proposed in this pull request?
   
   Adds a config for pandas conversion how to handle struct types.
   
   - `spark.sql.execution.pandas.structHandlingMode` (default: `"legacy"`)
   
   The conversion mode of struct type when creating pandas DataFrame.
   
   #### When `"legacy"`, the behavior is the same as before, except that with Arrow and Spark Connect will raise a more readable exception when there are duplicated nested field names.
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.UnsupportedOperationException: [DUPLICATED_FIELD_NAME_IN_ARROW_STRUCT] Duplicated field names in Arrow Struct are not allowed, got [a, a].
   ```
   
   #### When `"row"`, convert to Row object regardless of Arrow optimization.
   
   ```py
   >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'row')
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   ```
   
   #### When `"dict"`, convert to dict and use suffixed key names, e.g., `a_0`, `a_1`, if there are duplicated nested field names, regardless of Arrow optimization.
   
   ```py
   >>> spark.conf.set('spark.sql.execution.pandas.structHandlingMode', 'dict')
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x                     y
   0  1  {'a_0': 1, 'a_1': 2}
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x                     y
   0  1  {'a_0': 1, 'a_1': 2}
   ```
   
   ### Why are the changes needed?
   
   Currently there are three behaviors when `df.toPandas()` with nested struct types:
   
   - vanilla PySpark with Arrow optimization disabled
   
   ```py
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   ```
   
   using `Row` object for struct types.
   
   It can use duplicated field names.
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   ```
   
   - vanilla PySpark with Arrow optimization enabled
   
   ```py
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   ```
   
   using `dict` for struct types.
   
   It raises an Exception when there are duplicated nested field names:
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
   Traceback (most recent call last):
   ...
   pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed
   ```
   
   - Spark Connect
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   ```
   
   using `dict` for struct types.
   
   If there are duplicated nested field names, the duplicated keys are suffixed:
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x                     y
   0  1  {'a_0': 1, 'a_1': 2}
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Users will be able to configure the behavior.
   
   ### How was this patch tested?
   
   Modified the related tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #40988:
URL: https://github.com/apache/spark/pull/40988#issuecomment-1530022281

   Shall we add `.. versionchanged::` to `toPandas`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on pull request #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

Posted by "ueshin (via GitHub)" <gi...@apache.org>.

ueshin commented on PR #40988:
URL: https://github.com/apache/spark/pull/40988#issuecomment-1528338027

   I'll move the conversion function to `pyspark/sql/pandas/types.py`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #40988:
URL: https://github.com/apache/spark/pull/40988#discussion_r1181899926


##########
python/pyspark/sql/pandas/types.py:
##########
@@ -462,3 +467,233 @@ def _convert_dict_to_map_items(s: "PandasSeriesLike") -> "PandasSeriesLike":
     :return: pandas.Series of lists of (key, value) pairs
     """
     return cast("PandasSeriesLike", s.apply(lambda d: list(d.items()) if d is not None else None))
+
+
+def _to_corrected_pandas_type(dt: DataType) -> Optional[Any]:
+    """
+    When converting Spark SQL records to Pandas `pandas.DataFrame`, the inferred data type
+    may be wrong. This method gets the corrected data type for Pandas if that type may be
+    inferred incorrectly.
+    """
+    import numpy as np
+
+    if type(dt) == ByteType:
+        return np.int8
+    elif type(dt) == ShortType:
+        return np.int16
+    elif type(dt) == IntegerType:
+        return np.int32
+    elif type(dt) == LongType:
+        return np.int64
+    elif type(dt) == FloatType:
+        return np.float32
+    elif type(dt) == DoubleType:
+        return np.float64
+    elif type(dt) == BooleanType:
+        return bool
+    elif type(dt) == TimestampType:
+        return np.dtype("datetime64[ns]")
+    elif type(dt) == TimestampNTZType:
+        return np.dtype("datetime64[ns]")
+    elif type(dt) == DayTimeIntervalType:
+        return np.dtype("timedelta64[ns]")
+    else:
+        return None
+
+
+def _create_converter_to_pandas(

Review Comment:
   I am wondering if PySpark/Pandas data type mappings are ever documented publicly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng closed pull request #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng closed pull request #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types
URL: https://github.com/apache/spark/pull/40988


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #40988:
URL: https://github.com/apache/spark/pull/40988#issuecomment-1535194866

   Merged to master, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org