You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/06/30 06:35:38 UTC
[GitHub] [arrow] jorisvandenbossche commented on issue #36392: [Python][Parquet] Reading a parquet file containing timedeltas fails if it was written out using fastparquet

jorisvandenbossche commented on issue #36392:
URL: https://github.com/apache/arrow/issues/36392#issuecomment-1614196317

   This is a consequence of fastparquet writing the timedeltas as a "time" type (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time), and so pyarrow also reads this as a time64 type:
   
   ```
   with tempfile.TemporaryDirectory() as tmpdir:
       path = f"{tmpdir}/test.parquet"
       df.to_parquet(path, engine="fastparquet")
       table = pq.read_table(path)
       pq_meta = pq.read_metadata(path)
   
   >>> pq_meta.schema
   <pyarrow._parquet.ParquetSchema object at 0x7fbd2cd20d80>
   required group field_id=-1 schema {
     optional int64 field_id=-1 timedelta (Time(isAdjustedToUTC=true, timeUnit=microseconds));
   }
   >>> table.schema
   timedelta: time64[us]
   -- schema metadata --
   pandas: '{"column_indexes": [{"field_name": null, "metadata": null, "name' + 429
   ```
   
   But on conversion to python/pandas, pyarrow then tries to (expectedly) create `datetime.time` objects and not `datetime.timedelta`. And for times, those underlying integer values are way too big. If we manually cast to int and then to duration (pyarrow's timedelta), we see that it are still the correct values:
   
   ```
   >>> table["timedelta"].cast("int64")
   <pyarrow.lib.ChunkedArray object at 0x7fbd26196a20>
   [
     [
       86400000000,
       86400000000,
       691200000000,
       691200000000,
       172800000000,
       172800000000,
       691200000000,
       691200000000,
       691200000000,
       259200000000
     ]
   ]
   
   >>> table["timedelta"].cast("int64").cast("duration[us]").to_pandas()
   0   1 days
   1   1 days
   2   8 days
   3   8 days
   4   2 days
   5   2 days
   6   8 days
   7   8 days
   8   8 days
   9   3 days
   dtype: timedelta64[ns]
   ```
   
   Fastparquet does store metadata about the original pandas dataframe:
   
   ```
   >>> table.schema.pandas_metadata
   {'column_indexes': [{'field_name': None,
      'metadata': None,
      'name': None,
      'numpy_type': 'object',
      'pandas_type': 'mixed-integer'}],
    'columns': [{'field_name': 'timedelta',
      'metadata': None,
      'name': 'timedelta',
      'numpy_type': 'timedelta64[ns]',
      'pandas_type': 'timedelta64'}],
    'creator': {'library': 'fastparquet', 'version': '0.8.3'},
    'index_columns': [{'kind': 'range',
      'name': None,
      'start': 0,
      'step': 1,
      'stop': 10}],
    'pandas_version': '2.1.0.dev0+976.g870a504af9',
    'partition_columns': []}
   ``` 
   
   and this metadata indicates that the original column was timedelta64, and so in theory pyarrow _could_ use that information to restore the original pandas DataFrame when converting the table to pandas. However, we typically only use that metadata information in case we have data where it is unsure what to do (and in addition to restore the column/row indices), and in this case we have a proper time64 type from the point of view of pyarrow, which has a clear non-ambiguous mapping to python (i.e. ``datetime.time``).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org