You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "adampinky85 (via GitHub)" <gi...@apache.org> on 2024/03/27 06:27:13 UTC
[I] PyArrow Table to Pandas int8 conversion issue [arrow]

adampinky85 opened a new issue, #40815:
URL: https://github.com/apache/arrow/issues/40815

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Hi team, 
   
   We extensively use Arrow / Parquet files for data analysis with Pandas, it's excellent! We've found an issue that occurs converting between PyArrow Tables and Pandas Dataframe. 
   
   Due to large size of our dataset we write the parquet files using minimal field data types e.g categories and specifically we're using an `int8` in `field_6`. This field is unfortunately nullable which is likely the cause of the issue.
   
   The issue is that the `int8` field is converted to Pandas as a `float64`. In Pandas this can be fixed using the `dtype_backend="numpy_nullable"` argument and it converts to an `Int8`.
   
   Is there any equivalent mechanism, using pyarrow.parquet.read_pandas to retrieve the field with an `Int8` or equivalent? I assume the conversion to `float64` is forced due to the field being nullable. 
   
   Many thanks!
   
   PyArrow Parquet Metadata:
   ``` 
   <pyarrow._parquet.FileMetaData object at 0x7f9ca8d65030>
     created_by: parquet-cpp-arrow version 15.0.2
     num_columns: 11
     num_rows: 35734802
     num_row_groups: 35
     format_version: 2.6
     serialized_size: 41957
   ```
   
   
   PyArrow Parquet Schema:
   ``` 
   <pyarrow._parquet.ParquetSchema object at 0x7f9ca8d55f80>
   required group field_id=-1 schema {
     optional int64 field_id=-1 timestamp (Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false));
     optional binary field_id=-1 field_1 (String);
     optional binary field_id=-1 field_2 (String);
     optional binary field_id=-1 field_3 (String);
     optional binary field_id=-1 field_4 (String);
     optional binary field_id=-1 field_5 (String);
     optional int32 field_id=-1 field_6 (Int(bitWidth=8, isSigned=true));       <------ note Int8
     optional double field_id=-1 field_7;
     optional double field_id=-1 field_8;
     optional double field_id=-1 field_9;
     optional double field_id=-1 field_10;
   }
   ```
   
   1. Pandas read parquet:
   ```
   df = pd.read_parquet(filename, dtype_backend="numpy_nullable")
   timestamp datetime64[ms]
   field_1        category
   field_2        category
   field_3        category
   field_4        category
   field_5        category
   field_6        float64          <---- note float64
   field_7        float64
   field_8        float64
   field_9        float64
   field_10       float64
   ```
   
   2. Pandas read parquet + numpy_nullable:
   ```
   df = pd.read_parquet(filename, dtype_backend="numpy_nullable")
   timestamp datetime64[ms]
   field_1        category
   field_2        category
   field_3        category
   field_4        category
   field_5        category
   field_6        Int8           <---- note Int8
   field_7        Float64
   field_8        Float64
   field_9        Float64
   field_10       Float64
   ```
   
   3. PyArrow parquet read table:
   ```
   table = pyarrow.parquet.read_pandas(target_path, filesystem=s3_client)
   df = table.to_pandas()
   timestamp datetime64[ms]
   field_1        category
   field_2        category
   field_3        category
   field_4        category
   field_5        category
   field_6        float64          <---- note float64
   field_7        float64
   field_8        float64
   field_9        float64
   field_10       float64
   ```
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org