You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/09/17 12:00:00 UTC
[jira] [Created] (ARROW-14026) [C++] Batch readahead not working
correctly in Parquet scanner
Weston Pace created ARROW-14026:
-----------------------------------
Summary: [C++] Batch readahead not working correctly in Parquet scanner
Key: ARROW-14026
URL: https://issues.apache.org/jira/browse/ARROW-14026
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Weston Pace
The parquet scanner implements batch readahead by applying a readahead generator to the generator returned by parquet::arrow::FileReader::GetRecordBatchGenerator. However, that generator is constructed with MakeConcatenatedGenerator which, regrettably, has this comment:
> This generator is async-reentrant but will never pull from source reentrantly and will never pull from any subscription reentrantly.
This effectively prevents any batch readahead from happening and the file is always read one batch at a time. Part of the problem seems to be that ReadOneRowGroup in reader.cc returns a RecordBatchGenerator when it seems it should be able to return a RecordBatch. For the testing I am doing I changed this to return a single record batch which allowed me to get rid of the concatenated generator and batch readahead appeared to work properly but I didn't fully confirm the correctness of this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)