You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Keith Chapman <ke...@gmail.com> on 2016/12/21 00:18:53 UTC

[PARQUET_CPP] Reading consecutive columns is inefficient

Hi,

The java API of ParquetFileReader [1] (line 684) reads a row group as whole
into memory while the cpp API reads in a column at a time even if the
columns are consecutive. This causes multiple calls to seek and read and
can be inefficient when reading over a network. Are there any plans to
expand the cpp API to have the functionality to read a whole row group at
once (only the relevant columns as the java API)

Regards,
Keith.

[1]
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

Re: [PARQUET_CPP] Reading consecutive columns is inefficient

Posted by Wes McKinney <we...@gmail.com>.

hi Keith,

It seems perfect reasonable to add configurable read buffering, or an
option to buffer the entire row group if your environment permits it.
Can you create a JIRA about this? We would welcome contributions
around IO tuning for different hardware / network environments.

Note that in HDFS, this is less meaningful as my understanding is that
the RPC buffer size in the clients is typically 64K.

Thanks,
Wes

On Tue, Dec 20, 2016 at 7:18 PM, Keith Chapman <ke...@gmail.com> wrote:
> Hi,
>
> The java API of ParquetFileReader [1] (line 684) reads a row group as whole
> into memory while the cpp API reads in a column at a time even if the
> columns are consecutive. This causes multiple calls to seek and read and
> can be inefficient when reading over a network. Are there any plans to
> expand the cpp API to have the functionality to read a whole row group at
> once (only the relevant columns as the java API)
>
> Regards,
> Keith.
>
> [1]
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java