You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jacqueline Nolis (Jira)" <ji...@apache.org> on 2020/09/29 13:42:00 UTC
[jira] [Created] (ARROW-10133) parquet Int64 col cast to float64 on
load in pandas
Jacqueline Nolis created ARROW-10133:
----------------------------------------
Summary: parquet Int64 col cast to float64 on load in pandas
Key: ARROW-10133
URL: https://issues.apache.org/jira/browse/ARROW-10133
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.17.1
Reporter: Jacqueline Nolis
Attachments: example-failed-int64.parquet
Under certain conditions a saved parquet table with a column that is Int64 and all NA seems to be cast to a float64 with all NaN on load. The desired behavior is to have it stay as Int64. Attached is a table where said issue occurs: the second column here should be a int64 but is being loaded as a float64 in Pandas.
Interestingly, it seems to be correctly interpreting the column as a Int64 when loading in R, so perhaps its only a Pandas issue.
import pyarrow.parquet as pq
import boto3
import pandas as pd
import io
obj = boto3.client('s3').get_object(Bucket="...", Key='...') # file attached to ticket
x = pq.read_table(io.BytesIO(obj['Body'].read()))
y = x.to_pandas() # this is where the undesired int64 to a float64 cast occurs
# >>> x
# pyarrow.Table
# product_id: string
# cost: int64
# name: string
# >>> y.dtypes
# product_id object
# cost float64
# name object
# dtype: object
--
This message was sent by Atlassian Jira
(v8.3.4#803005)