You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/09/17 12:00:00 UTC

[jira] [Created] (ARROW-14026) [C++] Batch readahead not working correctly in Parquet scanner

Weston Pace created ARROW-14026:
-----------------------------------

             Summary: [C++] Batch readahead not working correctly in Parquet scanner
                 Key: ARROW-14026
                 URL: https://issues.apache.org/jira/browse/ARROW-14026
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Weston Pace


The parquet scanner implements batch readahead by applying a readahead generator to the generator returned by parquet::arrow::FileReader::GetRecordBatchGenerator.  However, that generator is constructed with MakeConcatenatedGenerator which, regrettably, has this comment:

> This generator is async-reentrant but will never pull from source reentrantly and will never pull from any subscription reentrantly.

This effectively prevents any batch readahead from happening and the file is always read one batch at a time.  Part of the problem seems to be that ReadOneRowGroup in reader.cc returns a RecordBatchGenerator when it seems it should be able to return a RecordBatch.  For the testing I am doing I changed this to return a single record batch which allowed me to get rid of the concatenated generator and batch readahead appeared to work properly but I didn't fully confirm the correctness of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)