You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/14 17:45:20 UTC

[GitHub] [arrow] bkietz commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

bkietz commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r669825752



##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning

Review comment:
       We don't guarantee order for selectors because ARROW-8163 (asynchronous fragment discovery) might not guarantee order. Lexicographic sorting *could* be maintained for synchronous discovery from a selector, but in general we'd want to push a fragment into scan as soon as it's yielded by `FileSystem::GetFileInfoGenerator`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org