You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Zhuo Jia Dai <zh...@gmail.com> on 2019/12/09 02:59:53 UTC

Reading Parquet Files in Chunks?

For example, pandas's read_csv has a chunk_size argument which allows the
read_csv to return an iterator on the CSV file so we can read it in chunks.

The Parquet format stores the data in chunks, but there isn't a documented
way to read in it chunks like read_csv.

Is there a way to read parquet files in chunks?
-- 
ZJ

zhuojia.dai@gmail.com

Re: Reading Parquet Files in Chunks?

Posted by Wes McKinney <we...@gmail.com>.

There is but it's not exposed in Python yet

See the "batch_size" parameter of ArrowReaderProperties

https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L565

and the GetRecordBatchReader method on parquet::arrow::FileReader.
There's some related work happening in the C++ Datasets project

I'd like to see batch-based reading refined and better documented both
in C++ and Python, this would be a nice project for a volunteer to
take on.

- Wes

On Sun, Dec 8, 2019 at 9:00 PM Zhuo Jia Dai <zh...@gmail.com> wrote:
>
>
> For example, pandas's read_csv has a chunk_size argument which allows the read_csv to return an iterator on the CSV file so we can read it in chunks.
>
> The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks like read_csv.
>
> Is there a way to read parquet files in chunks?
>
> --
> ZJ
>
> zhuojia.dai@gmail.com