You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "adampinky85 (via GitHub)" <gi...@apache.org> on 2024/03/27 06:27:13 UTC
[I] PyArrow Table to Pandas int8 conversion issue [arrow]
adampinky85 opened a new issue, #40815:
URL: https://github.com/apache/arrow/issues/40815
### Describe the bug, including details regarding any error messages, version, and platform.
Hi team,
We extensively use Arrow / Parquet files for data analysis with Pandas, it's excellent! We've found an issue that occurs converting between PyArrow Tables and Pandas Dataframe.
Due to large size of our dataset we write the parquet files using minimal field data types e.g categories and specifically we're using an `int8` in `field_6`. This field is unfortunately nullable which is likely the cause of the issue.
The issue is that the `int8` field is converted to Pandas as a `float64`. In Pandas this can be fixed using the `dtype_backend="numpy_nullable"` argument and it converts to an `Int8`.
Is there any equivalent mechanism, using pyarrow.parquet.read_pandas to retrieve the field with an `Int8` or equivalent? I assume the conversion to `float64` is forced due to the field being nullable.
Many thanks!
PyArrow Parquet Metadata:
```
<pyarrow._parquet.FileMetaData object at 0x7f9ca8d65030>
created_by: parquet-cpp-arrow version 15.0.2
num_columns: 11
num_rows: 35734802
num_row_groups: 35
format_version: 2.6
serialized_size: 41957
```
PyArrow Parquet Schema:
```
<pyarrow._parquet.ParquetSchema object at 0x7f9ca8d55f80>
required group field_id=-1 schema {
optional int64 field_id=-1 timestamp (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
optional binary field_id=-1 field_1 (String);
optional binary field_id=-1 field_2 (String);
optional binary field_id=-1 field_3 (String);
optional binary field_id=-1 field_4 (String);
optional binary field_id=-1 field_5 (String);
optional int32 field_id=-1 field_6 (Int(bitWidth=8, isSigned=true)); <------ note Int8
optional double field_id=-1 field_7;
optional double field_id=-1 field_8;
optional double field_id=-1 field_9;
optional double field_id=-1 field_10;
}
```
1. Pandas read parquet:
```
df = pd.read_parquet(filename, dtype_backend="numpy_nullable")
timestamp datetime64[ms]
field_1 category
field_2 category
field_3 category
field_4 category
field_5 category
field_6 float64 <---- note float64
field_7 float64
field_8 float64
field_9 float64
field_10 float64
```
2. Pandas read parquet + numpy_nullable:
```
df = pd.read_parquet(filename, dtype_backend="numpy_nullable")
timestamp datetime64[ms]
field_1 category
field_2 category
field_3 category
field_4 category
field_5 category
field_6 Int8 <---- note Int8
field_7 Float64
field_8 Float64
field_9 Float64
field_10 Float64
```
3. PyArrow parquet read table:
```
table = pyarrow.parquet.read_pandas(target_path, filesystem=s3_client)
df = table.to_pandas()
timestamp datetime64[ms]
field_1 category
field_2 category
field_3 category
field_4 category
field_5 category
field_6 float64 <---- note float64
field_7 float64
field_8 float64
field_9 float64
field_10 float64
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org