You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "LyricalBernardin (via GitHub)" <gi...@apache.org> on 2023/09/13 17:02:02 UTC

[GitHub] [arrow] LyricalBernardin opened a new issue, #37699: Individual RangeIndex not preserved with partitioning

LyricalBernardin opened a new issue, #37699:
URL: https://github.com/apache/arrow/issues/37699

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I am testing with 13.0.0 and 9.0.0. I am saving multiple dataframes separately with the following function
   
   def save(table):
       pq.write_to_dataset(
           table=table,
           root_path=root_path,
           partition_cols=['date'],
           basename_template='part-{i}.parquet',
           use_threads=use_threads,
           use_legacy_dataset=False,
           existing_data_behavior='overwrite_or_ignore',
           use_deprecated_int96_timestamps=False,
           compression=compression,
           **kwargs,
       )
   
   Each dataframe `df` has unique `date` column values and it's own RangeIndex range(len(df)). With 'overwrite_or_ignore' I am saving all dataframes to the same root_path.
   When I load any individual partition, everything looks good. However, when reading the root_path folder as a ParquetDataset, returned dataframe `df_all` doesn't preserve the indices of its partitions and its index becomes RangeIndex(len(df_all)).
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org