You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/06 21:21:17 UTC

[GitHub] [arrow] westonpace commented on a diff in pull request #34461: GH-34460: [C++][Parquet] Split arrow::FileReader::ReadRowGroups() for flexible async IO

westonpace commented on code in PR #34461:
URL: https://github.com/apache/arrow/pull/34461#discussion_r1127045179


##########
cpp/src/parquet/arrow/reader.h:
##########
@@ -249,6 +249,13 @@ class PARQUET_EXPORT FileReader {
 
   virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out) = 0;
 
+  virtual ::arrow::Status WillNeedRowGroups(const std::vector<int>& row_groups,
+                                            const std::vector<int>& column_indices) = 0;

Review Comment:
   https://github.com/apache/arrow/pull/14723 adds a filesystem method for "read many".  I would like to see this method support plugging and splitting in the same way that `ReadRangeCache` does today (then, `ReadRangeCache` will only be needed if you need true "caching").  Then I think we can use that instead of the `ReadRangeCache`.
   
   This will allow local filesystems to rely on the OS for plugging & splitting and will allow remote filesystems like S3 to adapt the algorithm to their needs.  It's also async and returns a future reliably so you can then return a future from this method (I agree that would be desired).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org