You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Cheng Su <sc...@gmail.com> on 2022/09/02 18:32:19 UTC

[C++][Python] Recommend way to just read several rows from Parquet

Hello,

I am using PyArrow, and encountering an OOM issue when reading the Parquet
file. My end goal is to sample just a few rows (~5 rows) from any Parquet
file, to estimate in-memory data size of the whole file, based on sampled
rows.

We tried the following approaches:
* `to_batches(batch_size=5)` -
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDataset.html#pyarrow.dataset.FileSystemDataset.to_batches
* `head(num_rows=5, batch_size=5)` -
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.head

But with both approaches, we encountered OOM issues when just reading 5
rows several times from ~2GB Parquet file.Then we tried
`to_batches(batch_size=100000)`, and it works fine without OOM issue.

I am confused and what to know is the underlying behavior in C++ Arrow
Parquet reader, when setting batch_size to be small? I guess there might be
some exponential overhead associated with batch_size when its value is
small.

Thanks,
Cheng Su

Re: [C++][Python] Recommend way to just read several rows from Parquet

Posted by Weston Pace <we...@gmail.com>.

Setting the batch size will not have too much of an impact on the
amount of memory used.  That is mostly controlled by I/O readahead
(e.g. how many record batches to read at once).  The readahead
settings are not currently exposed to pyarrow although a PR was
recently merged[1] that should make this available in 10.0.0.

OOM when reading a single 2GB parquet file seems kind of extreme.  How
much RAM is available on the system?  Do you know if the parquet file
has some very compressive encodings (e.g. dictionary encoding with
long strings or run length encoding with long runs)?

> I am confused and what to know is the underlying behavior in C++ Arrow Parquet reader, when setting batch_size to be small?

Basically the readahead tries to keep some number of rows in flight.
If the batches are small then it tries to run lots of rows at once.
If the batches are large then it will only run a few rows at once.  So
yes, extremely small batches will incur a lot of overhead, both in
terms of RAM and compute.

> My end goal is to sample just a few rows (~5 rows) from any Parquet file, to estimate in-memory data size of the whole file, based on sampled rows.

I'm not sure 5 rows will be enough for this.  However, one option
might be to just read in a single row group (assuming the file has
multiple row groups).

One last idea might be to disable pre-buffering.  Pre-buffering is
currently using too much RAM on file reads[2].  You could also try
setting use_legacy_dataset to True.  The legacy reader isn't quite so
aggressive with readahead and might use less RAM.  However, I still
don't think you'll be able to do better than reading a single row
group.

[1] https://github.com/apache/arrow/pull/13799
[2] https://issues.apache.org/jira/browse/ARROW-17599

On Fri, Sep 2, 2022 at 8:32 AM Cheng Su <sc...@gmail.com> wrote:
>
> Hello,
>
> I am using PyArrow, and encountering an OOM issue when reading the Parquet file. My end goal is to sample just a few rows (~5 rows) from any Parquet file, to estimate in-memory data size of the whole file, based on sampled rows.
>
> We tried the following approaches:
> * `to_batches(batch_size=5)` - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileSystemDataset.html#pyarrow.dataset.FileSystemDataset.to_batches
> * `head(num_rows=5, batch_size=5)` - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.head
>
> But with both approaches, we encountered OOM issues when just reading 5 rows several times from ~2GB Parquet file.Then we tried `to_batches(batch_size=100000)`, and it works fine without OOM issue.
>
> I am confused and what to know is the underlying behavior in C++ Arrow Parquet reader, when setting batch_size to be small? I guess there might be some exponential overhead associated with batch_size when its value is small.
>
> Thanks,
> Cheng Su