You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/01/06 10:46:00 UTC

[jira] [Created] (ARROW-11142) [C++][Parquet] Inconsistent batch_size usage in parquet GetRecordBatchReader

Joris Van den Bossche created ARROW-11142:
---------------------------------------------

             Summary: [C++][Parquet] Inconsistent batch_size usage in parquet GetRecordBatchReader
                 Key: ARROW-11142
                 URL: https://issues.apache.org/jira/browse/ARROW-11142
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


The RecordBatchReader returned from {{parquet::arrow::FileReader::GetRecordBatchReader}}, which was originally introduced in ARROW-1012 and now exposed in Python (ARROW-7800), shows some inconsistent behaviour in how the {{batch_size}} is followed across parquet file row groups. 

See also comments at https://github.com/apache/arrow/pull/6979#issuecomment-754672429

Small example with a parquet file of 300 rows consisting of 3 row groups of 100 rows:

{code}
table = pa.table({'a': range(300)})
pq.write_table(table, "test.parquet", row_group_size=100)
f = pq.ParquetFile("test.parquet")
{code}

When reading this with a batch_size that doesn't align with the size of the row groups, by default batches that cross the row group boundaries are returned:

{code}
In [5]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
Out[5]: [80, 80, 80, 60]
{code}

However, when the file contains a dictionary typed column with string values (integer dictionary values doesn't trigger it), the batches follow row group boundaries:

{code}
table = pa.table({'a': pd.Categorical([str(x) for x in range(300)])})
pq.write_table(table, "test.parquet", row_group_size=100)
f = pq.ParquetFile("test.parquet")

In [13]: [batch.num_rows for batch in f.iter_batches(batch_size=80)]
Out[13]: [80, 20, 60, 40, 40, 60]
{code}

But it doesn't start to count again for batch_size at the beginning of a row group, so it only splits batches.

And additionally, when reading empty batches (empty column selection), then the row group boundaries are followed, but differently (the batching is done independently for each row group):

{code}
In [14]: [batch.num_rows for batch in f.iter_batches(batch_size=80, columns=[])]
Out[14]: [80, 20, 80, 20, 80, 20]
{code}

(this is explicitly coded here: https://github.com/apache/arrow/blob/e05f032c1e5d590ac56372d13ec637bd28b47a96/cpp/src/parquet/arrow/reader.cc#L899-L921)

---

I don't know what the expected behaviour should be, but I would at least expect it to be consistent?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)