You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:42:26 UTC

[jira] [Resolved] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

     [ https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-24357.
----------------------------------
    Resolution: Incomplete

> createDataFrame in Python infers large integers as long type and then fails silently when converting them
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24357
>                 URL: https://issues.apache.org/jira/browse/SPARK-24357
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Joel Croteau
>            Priority: Major
>              Labels: bulk-closed
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark SQL will infer any integral type as a LongType, which is a 64-bit integer, without actually checking whether the values will fit into a 64-bit slot. If the values are larger than 64 bits, then when pickled and unpickled in Java, Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is called, it will ignore the BigInteger type and return Null. This results in any large integers in the resulting DataFrame being silently converted to None. This can create some very surprising and difficult to debug behavior, in particular if you are not aware of this limitation. There should either be a runtime error at some point in this conversion chain, or else _infer_type should infer larger integers as DecimalType with appropriate precision, or as BinaryType. The former would be less convenient, but the latter may be problematic to implement in practice. In any case, we should stop silently converting large integers to None.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org