You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Zhuo Jia Dai <zh...@gmail.com> on 2019/12/09 02:59:53 UTC
Reading Parquet Files in Chunks?
For example, pandas's read_csv has a chunk_size argument which allows the
read_csv to return an iterator on the CSV file so we can read it in chunks.
The Parquet format stores the data in chunks, but there isn't a documented
way to read in it chunks like read_csv.
Is there a way to read parquet files in chunks?
--
ZJ
zhuojia.dai@gmail.com
Re: Reading Parquet Files in Chunks?
Posted by Wes McKinney <we...@gmail.com>.
There is but it's not exposed in Python yet
See the "batch_size" parameter of ArrowReaderProperties
https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L565
and the GetRecordBatchReader method on parquet::arrow::FileReader.
There's some related work happening in the C++ Datasets project
I'd like to see batch-based reading refined and better documented both
in C++ and Python, this would be a nice project for a volunteer to
take on.
- Wes
On Sun, Dec 8, 2019 at 9:00 PM Zhuo Jia Dai <zh...@gmail.com> wrote:
>
>
> For example, pandas's read_csv has a chunk_size argument which allows the read_csv to return an iterator on the CSV file so we can read it in chunks.
>
> The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks like read_csv.
>
> Is there a way to read parquet files in chunks?
>
> --
> ZJ
>
> zhuojia.dai@gmail.com