You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by edlee123 <gi...@git.apache.org> on 2018/04/14 22:55:04 UTC
[GitHub] spark pull request #18378: [SPARK-21163][SQL] DataFrame.toPandas should resp...
Github user edlee123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18378#discussion_r181565606
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm):
return sc._jvm.PythonUtils.toScalaMap(jm)
+def _to_corrected_pandas_type(dt):
+ """
+ When converting Spark SQL records to Pandas DataFrame, the inferred data type may be wrong.
+ This method gets the corrected data type for Pandas if that type may be inferred uncorrectly.
+ """
+ import numpy as np
+ if type(dt) == ByteType:
+ return np.int8
+ elif type(dt) == ShortType:
+ return np.int16
+ elif type(dt) == IntegerType:
+ return np.int32
+ elif type(dt) == FloatType:
+ return np.float32
+ else:
--- End diff --
Had a question: in Spark 2.2.1, if I do a .toPandas on a Spark DataFrame with column integer type, the dtypes in pandas is int64. Whereas in in Spark 2.3.0 they ints are converted to int32. I ran the below in Spark 2.2.1 and 2.3.0:
```
df = spark.sparkContext.parallelize([(i, ) for i in [1, 2, 3]]).toDF(["a"]).select(sf.col('a').cast('int')).toPandas()
df.dtypes
```
Is this intended? We ran into as we have unit tests in a project that passed in Spark 2.2.1 that fail in Spark 2.3.0
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org