You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/05 14:17:05 UTC

[GitHub] [arrow-rs] tustvold edited a comment on pull request #1154: Add `async` arrow parquet reader

tustvold edited a comment on pull request #1154:
URL: https://github.com/apache/arrow-rs/pull/1154#issuecomment-1030632686


   > the notion that the column chunk is the basic i/o unit for Parquet is somewhat outdates with the introduction of the index page.
   
   I agree, in so much as whatever mechanism we eventually add for more granular filter pushdown, be it the page index or something more sophisticated such as described in #1191, I would anticipate using to refine the data `ParquetRecordBatchStream` fetches prior to decode. That being said, currently this crate doesn't even support decoding the index pages, let alone doing anything with them :sweat_smile: 
   
   > so continuously downloading in the background for data the client
   
   This PR does not add functionality for doing this, it adds hooks for a query engine to use for doing this by providing something implementing `AsyncRead` and `AsyncSeek`. This has been a frequent ask within Datafusion and https://github.com/apache/arrow-datafusion/pull/1617 begins to flesh out how this might look. The parquet crate would not have anything to do with the actually fetching data from object storage
   
   > complicating all existing client by the added "Send" constraint.
   
   Are these additions this causing an issue for you? I have to confess I did not anticipate this causing issues, as almost all types are `Send`. Is there a particular one causing an issue, as we could potentially feature gate it behind the `async` feature flag?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org