You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Angus Hollands (Jira)" <ji...@apache.org> on 2021/06/23 19:27:00 UTC

[jira] [Created] (ARROW-13153) `parquet_dataset` loses ordering of files in `_metadata`

Angus Hollands created ARROW-13153:
--------------------------------------

             Summary: `parquet_dataset` loses ordering of files in `_metadata`
                 Key: ARROW-13153
                 URL: https://issues.apache.org/jira/browse/ARROW-13153
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet
            Reporter: Angus Hollands


Hi all, thanks for the useful library!

I noticed when calling {{pyarrow.dataset.parquet_dataset}}
 that the order of the files ({{dataset.files}}) does not match that which is stored inĀ {{_metadata}} via the `metadata.row_group(i).column(0).file_path`. I'm not an Arrow expert by any means, but is this intentional?

I think the unordered map is the culprit, but I have not recompiled to test this theory. [https://github.com/apache/arrow/blob/133b1a904bf7fc1d24343c306a2279e27d4ebe6d/cpp/src/arrow/dataset/file_parquet.cc#L870]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)