You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "xinrong-meng (via GitHub)" <gi...@apache.org> on 2023/05/01 21:17:21 UTC

[GitHub] [spark] xinrong-meng commented on a diff in pull request #40988: [SPARK-41971][SQL][PYTHON] Add a config for pandas conversion how to handle struct types

xinrong-meng commented on code in PR #40988:
URL: https://github.com/apache/spark/pull/40988#discussion_r1181899926


##########
python/pyspark/sql/pandas/types.py:
##########
@@ -462,3 +467,233 @@ def _convert_dict_to_map_items(s: "PandasSeriesLike") -> "PandasSeriesLike":
     :return: pandas.Series of lists of (key, value) pairs
     """
     return cast("PandasSeriesLike", s.apply(lambda d: list(d.items()) if d is not None else None))
+
+
+def _to_corrected_pandas_type(dt: DataType) -> Optional[Any]:
+    """
+    When converting Spark SQL records to Pandas `pandas.DataFrame`, the inferred data type
+    may be wrong. This method gets the corrected data type for Pandas if that type may be
+    inferred incorrectly.
+    """
+    import numpy as np
+
+    if type(dt) == ByteType:
+        return np.int8
+    elif type(dt) == ShortType:
+        return np.int16
+    elif type(dt) == IntegerType:
+        return np.int32
+    elif type(dt) == LongType:
+        return np.int64
+    elif type(dt) == FloatType:
+        return np.float32
+    elif type(dt) == DoubleType:
+        return np.float64
+    elif type(dt) == BooleanType:
+        return bool
+    elif type(dt) == TimestampType:
+        return np.dtype("datetime64[ns]")
+    elif type(dt) == TimestampNTZType:
+        return np.dtype("datetime64[ns]")
+    elif type(dt) == DayTimeIntervalType:
+        return np.dtype("timedelta64[ns]")
+    else:
+        return None
+
+
+def _create_converter_to_pandas(

Review Comment:
   I am wondering if PySpark/Pandas data type mappings are ever documented publicly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org