You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Keith Chapman <ke...@gmail.com> on 2017/01/13 00:21:01 UTC

[PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Hi,

I'm using the the parquet-cpp library to read in some parquet files. I seen
that the parquet-cpp library has support for arrow and hence I thought of
giving it a shot. When running experiments I did not see any significant
increase in performance hence I was taking a look at the code. It looks to
me like the arrow reader uses and intermediate buffer to store the data and
hence does an extra copy, is this because of the mismatch in data types
between parquet and arrow? I'm specifically refering to the
FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line 276).
Also I would imagine that setting one bit at a time would be inefficient,
not too sure if the compiler would be smart enough to set a work at a time
(I doubt it though). Just wondering if there was a reason behind having the
code as it is.

[1]
https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc


Regards,
Keith.

http://keith-chapman.com

Re: [PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Posted by Keith Chapman <ke...@gmail.com>.

Cool, thanks Uwe. Wil try using it within the coming days.

Regards,
Keith.

http://keith-chapman.com

On Tue, Jan 17, 2017 at 11:44 PM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hi Keith,
>
> just a small heads up: the pull request for the read path is merged, I'm
> currently looking into removing all those copies in the write as well.
>
> Cheers
> Uwe
>
> On Fri, Jan 13, 2017, at 02:20 AM, Keith Chapman wrote:
> > Cool, Thanks for the update Wes. I was wondering if there was some deign
> > issue I was not aware of :). I will keep my eyes on the PR and llok to
> > make
> > more optimizations and upstream it.
> >
> > Regards,
> > Keith.
> >
> > http://keith-chapman.com
> >
> > On Thu, Jan 12, 2017 at 5:15 PM, Wes McKinney <we...@gmail.com>
> > wrote:
> >
> > > hi Keith
> > >
> > > Uwe is working on this right now (avoiding the extra copy):
> > >
> > > https://github.com/apache/parquet-cpp/pull/218
> > >
> > > We would appreciate any efforts to further optimize these code paths.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Thu, Jan 12, 2017 at 7:21 PM, Keith Chapman <
> keithgchapman@gmail.com>
> > > wrote:
> > > > Hi,
> > > >
> > > > I'm using the the parquet-cpp library to read in some parquet files.
> I
> > > seen
> > > > that the parquet-cpp library has support for arrow and hence I
> thought of
> > > > giving it a shot. When running experiments I did not see any
> significant
> > > > increase in performance hence I was taking a look at the code. It
> looks
> > > to
> > > > me like the arrow reader uses and intermediate buffer to store the
> data
> > > and
> > > > hence does an extra copy, is this because of the mismatch in data
> types
> > > > between parquet and arrow? I'm specifically refering to the
> > > > FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line
> 276).
> > > > Also I would imagine that setting one bit at a time would be
> inefficient,
> > > > not too sure if the compiler would be smart enough to set a work at a
> > > time
> > > > (I doubt it though). Just wondering if there was a reason behind
> having
> > > the
> > > > code as it is.
> > > >
> > > > [1]
> > > > https://github.com/apache/parquet-cpp/blob/master/src/
> > > parquet/arrow/reader.cc
> > > >
> > > >
> > > > Regards,
> > > > Keith.
> > > >
> > > > http://keith-chapman.com
> > >
>

Re: [PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Posted by "Uwe L. Korn" <uw...@xhochy.com>.

Hi Keith,

just a small heads up: the pull request for the read path is merged, I'm
currently looking into removing all those copies in the write as well.

Cheers
Uwe

On Fri, Jan 13, 2017, at 02:20 AM, Keith Chapman wrote:
> Cool, Thanks for the update Wes. I was wondering if there was some deign
> issue I was not aware of :). I will keep my eyes on the PR and llok to
> make
> more optimizations and upstream it.
> 
> Regards,
> Keith.
> 
> http://keith-chapman.com
> 
> On Thu, Jan 12, 2017 at 5:15 PM, Wes McKinney <we...@gmail.com>
> wrote:
> 
> > hi Keith
> >
> > Uwe is working on this right now (avoiding the extra copy):
> >
> > https://github.com/apache/parquet-cpp/pull/218
> >
> > We would appreciate any efforts to further optimize these code paths.
> >
> > Thanks
> > Wes
> >
> > On Thu, Jan 12, 2017 at 7:21 PM, Keith Chapman <ke...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > > I'm using the the parquet-cpp library to read in some parquet files. I
> > seen
> > > that the parquet-cpp library has support for arrow and hence I thought of
> > > giving it a shot. When running experiments I did not see any significant
> > > increase in performance hence I was taking a look at the code. It looks
> > to
> > > me like the arrow reader uses and intermediate buffer to store the data
> > and
> > > hence does an extra copy, is this because of the mismatch in data types
> > > between parquet and arrow? I'm specifically refering to the
> > > FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line 276).
> > > Also I would imagine that setting one bit at a time would be inefficient,
> > > not too sure if the compiler would be smart enough to set a work at a
> > time
> > > (I doubt it though). Just wondering if there was a reason behind having
> > the
> > > code as it is.
> > >
> > > [1]
> > > https://github.com/apache/parquet-cpp/blob/master/src/
> > parquet/arrow/reader.cc
> > >
> > >
> > > Regards,
> > > Keith.
> > >
> > > http://keith-chapman.com
> >

Re: [PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Posted by Keith Chapman <ke...@gmail.com>.

Cool, Thanks for the update Wes. I was wondering if there was some deign
issue I was not aware of :). I will keep my eyes on the PR and llok to make
more optimizations and upstream it.

Regards,
Keith.

http://keith-chapman.com

On Thu, Jan 12, 2017 at 5:15 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Keith
>
> Uwe is working on this right now (avoiding the extra copy):
>
> https://github.com/apache/parquet-cpp/pull/218
>
> We would appreciate any efforts to further optimize these code paths.
>
> Thanks
> Wes
>
> On Thu, Jan 12, 2017 at 7:21 PM, Keith Chapman <ke...@gmail.com>
> wrote:
> > Hi,
> >
> > I'm using the the parquet-cpp library to read in some parquet files. I
> seen
> > that the parquet-cpp library has support for arrow and hence I thought of
> > giving it a shot. When running experiments I did not see any significant
> > increase in performance hence I was taking a look at the code. It looks
> to
> > me like the arrow reader uses and intermediate buffer to store the data
> and
> > hence does an extra copy, is this because of the mismatch in data types
> > between parquet and arrow? I'm specifically refering to the
> > FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line 276).
> > Also I would imagine that setting one bit at a time would be inefficient,
> > not too sure if the compiler would be smart enough to set a work at a
> time
> > (I doubt it though). Just wondering if there was a reason behind having
> the
> > code as it is.
> >
> > [1]
> > https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/reader.cc
> >
> >
> > Regards,
> > Keith.
> >
> > http://keith-chapman.com
>

Re: [PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Posted by Wes McKinney <we...@gmail.com>.

hi Keith

Uwe is working on this right now (avoiding the extra copy):

https://github.com/apache/parquet-cpp/pull/218

We would appreciate any efforts to further optimize these code paths.

Thanks
Wes

On Thu, Jan 12, 2017 at 7:21 PM, Keith Chapman <ke...@gmail.com> wrote:
> Hi,
>
> I'm using the the parquet-cpp library to read in some parquet files. I seen
> that the parquet-cpp library has support for arrow and hence I thought of
> giving it a shot. When running experiments I did not see any significant
> increase in performance hence I was taking a look at the code. It looks to
> me like the arrow reader uses and intermediate buffer to store the data and
> hence does an extra copy, is this because of the mismatch in data types
> between parquet and arrow? I'm specifically refering to the
> FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line 276).
> Also I would imagine that setting one bit at a time would be inefficient,
> not too sure if the compiler would be smart enough to set a work at a time
> (I doubt it though). Just wondering if there was a reason behind having the
> code as it is.
>
> [1]
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc
>
>
> Regards,
> Keith.
>
> http://keith-chapman.com