You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/30 11:37:37 UTC

[GitHub] [arrow] MMCMA opened a new issue, #13262: Sort order issue pyarrow version 8.0.0. vs 6.0.1

MMCMA opened a new issue, #13262:
URL: https://github.com/apache/arrow/issues/13262

   I noticed an issue with the row order when storing in dataset store. This code works in version 6.0.1 but fails in 8.0.0. due to different row/ index order that seems to be stored.  pd.sort_index() solves the problem (but I would like to avoid this compute expensive operation) 
   
           import pandas as pd
           import numpy as np
           import pyarrow as pa
           import pyarrow.dataset as ds
           import pyarrow.parquet as pq
           from pathlib import Path
   
           rows = 500
           columns = 300
           project_path =  '/TBD'
           path = Path(project_path) / 'sort_issue'
           path.mkdir()
   
           data = np.random.normal(size=(rows, columns))
           index = pd.date_range('19900101', periods=rows, freq='b')
           data = pd.DataFrame(data=data, index=index).stack().to_frame('a')
           year = [x.year for x in data.index.get_level_values(0)]
           data['year'] = year
   
   
           tbl = pa.Table.from_pandas(data)  # NOQA
           pq.write_to_dataset(tbl, root_path=path,  partition_cols=['year'])
           tmp_ds = ds.dataset(path, format="parquet")
   
           data_disk = tmp_ds.to_table().to_pandas()
   
           data_disk.loc['19910101': '19911231']


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #13262: Sort order lost in pq.write_to_dataset() version 8.0.0. vs 6.0.1

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #13262:
URL: https://github.com/apache/arrow/issues/13262#issuecomment-1142670801

   Is this a duplicate of https://issues.apache.org/jira/browse/ARROW-10883 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] MMCMA commented on issue #13262: Sort order lost in pq.write_to_dataset() version 8.0.0. vs 6.0.1

Posted by GitBox <gi...@apache.org>.
MMCMA commented on issue #13262:
URL: https://github.com/apache/arrow/issues/13262#issuecomment-1143554198

   Yes, it it seems. It relates mostly to https://issues.apache.org/jira/browse/ARROW-16506. In our use case the order is very important and doing explicit in memory sorting is for the size of our data currently expensive. Downgrading to 7.0.0 might fix the problem at this moment,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org