You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/11/17 16:38:00 UTC
[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader

    [ https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635445#comment-17635445 ] 

ASF GitHub Bot commented on PARQUET-2149:
-----------------------------------------

wgtmac commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1318904541

   It looks like this PR is complete and all comments are addressed except some outstanding ones:
   
   - Adopt the incoming Hadoop vectored io api.
   - Benchmark against remote object stores from different cloud providers.
   
   IMO, switching `ioThreadPool` and `processThreadPool` the reader instance level will make it more flexible. 
   
   @parthchandra Do you have any TODO list for this PR? If you are busy with some other stuff, I can continue to work on it because it is really a nice improvement.
   
   cc @shangxinli 




> Implement async IO for Parquet file reader
> ------------------------------------------
>
>                 Key: PARQUET-2149
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2149
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Parth Chandra
>            Priority: Major
>
> ParquetFileReader's implementation has the following flow (simplified) - 
>       - For every column -> Read from storage in 8MB blocks -> Read all uncompressed pages into output queue 
>       - From output queues -> (downstream ) decompression + decoding
> This flow is serialized, which means that downstream threads are blocked until the data has been read. Because a large part of the time spent is waiting for data from storage, threads are idle and CPU utilization is really low.
> There is no reason why this cannot be made asynchronous _and_ parallel. So 
> For Column _i_ -> reading one chunk until end, from storage -> intermediate output queue -> read one uncompressed page until end -> output queue -> (downstream ) decompression + decoding
> Note that this can be made completely self contained in ParquetFileReader and downstream implementations like Iceberg and Spark will automatically be able to take advantage without code change as long as the ParquetFileReader apis are not changed. 
> In past work with async io  [Drill - async page reader |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java] , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)