You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/05/24 10:04:48 UTC
[GitHub] [arrow] jorisvandenbossche commented on issue #33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

jorisvandenbossche commented on issue #33321:
URL: https://github.com/apache/arrow/issues/33321#issuecomment-1560826211

   The "pandas metadata" is custom metadata that we store in the pyarrow schema whenever the data is created from a pandas.DataFrame:
   
   ```python
   >>> df = pd.DataFrame({"col": pd.date_range("2012-01-01", periods=3, freq="D")})
   >>> df
            col
   0 2012-01-01
   1 2012-01-02
   2 2012-01-03
   >>> table = pa.table(df)
   >>> table.schema
   col: timestamp[ns]
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 413
   # easier way to access this (and converted to a dict)
   >>> table.schema.pandas_metadata
   {'index_columns': [{'kind': 'range',
      'name': None,
      'start': 0,
      'stop': 3,
      'step': 1}],
    'column_indexes': [{'name': None,
      'field_name': None,
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': {'encoding': 'UTF-8'}}],
    'columns': [{'name': 'col',
      'field_name': 'col',
      'pandas_type': 'datetime',
      'numpy_type': 'datetime64[ns]',
      'metadata': None}],
    'creator': {'library': 'pyarrow', 'version': '13.0.0.dev106+gfbe5f641d'},
    'pandas_version': '2.1.0.dev0+484.g7187e67500'}
   ```
   
   So this indicates that the original data in the pandas.DataFrame had "datetime64[ns]" type. In this case that matches the arrow type, but for example after a roundtrip through Parquet, this might no longer be the case:
   
   ```python
   >>> pq.write_table(table, "test.parquet")
   >>> table2 = pq.read_table("test.parquet")
   >>> table2.schema
   Out[33]: 
   col: timestamp[us]                 # <--- now us instead of ns
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 413
   >>>  table2.schema.pandas_metadata
   { ...
    'columns': [{'name': 'col',
      'field_name': 'col',
      'pandas_type': 'datetime',
      'numpy_type': 'datetime64[ns]',   # <--- but this still indicates ns
      'metadata': None}],
   ...
   ```
   
   So the question is here what `table2.to_pandas()` should do? Use the microsecond resolution of the data, of the nanosecond resolution of the metadata? 
   
   (note that this is also a consequence of the default parquet version we write not yet supporting nanoseconds, while we should probably bump that default version we write, and then the nanoseconds would be preserved in the Parquet roundtrip)
   
   Now, I am not sure if it would be easy to use the information of the pandas metadata to influence the conversion, as we typically only use the metadata after converting the actual data to finalize the resulting pandas DataFrame (eg set the index, cast the column names, ..). 
   And I am also not fully sure if it would actually be desirable to follow the pandas metadata, since that would involve an extra conversion step (and effectively all existing pandas metadata (eg in already written parquet files) will always say that it is nanoseconds, since until recently that was the only supported resolution by pandas).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org