You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Shawn Zeng <xz...@gmail.com> on 2022/02/24 05:07:35 UTC

[Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Hi all, I found that for the same parquet file,
using pq.ParquetFile(file_name).read() takes 6s while
pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
they use the same internals but it seems not. The parquet file is 865MB,
snappy compression and enable dictionary. All other settings are default,
writing with pyarrow.

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Posted by Weston Pace <we...@gmail.com>.

The issue was a combination of python & C++ so it isn't something we'd
see in the micro benchmarks.  In the macro benchmarks this regression
actually did show up pretty clearly [1] but I didn't notice it in the
PR comment that conbench made.  Jonathan Keane raised [2] on the
conbench repo which would consider more salient reporting of
regressions.  We may also consider reviewing some of the largest
outstanding regressions as we approach a release or as a part of the
RC process.

[1] https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/
[2] https://github.com/conbench/conbench/issues/307

On Tue, Mar 8, 2022 at 9:05 AM Wes McKinney <we...@gmail.com> wrote:
>
> Since this isn't the first time this specific issue has happened in a
> major release, is there a way that a test or benchmark regression
> check could be introduced to prevent this category of problem in the
> future?
>
> On Thu, Feb 24, 2022 at 9:48 PM Weston Pace <we...@gmail.com> wrote:
> >
> > Thanks for reporting this.  It seems a regression crept into 7.0.0
> > that accidentally disabled parallel column decoding when
> > pyarrow.parquet.read_table is called with a single file.  I have filed
> > [1] and should have a fix for it before the next release.  As a
> > workaround you can use the datasets API directly, this is already what
> > pyarrow.parquet.read_table is using under the hood when
> > use_legacy_dataset=False.  Or you can continue using
> > use_legacy_dataset=True.
> >
> > import pyarrow.dataset as ds
> > table = ds.dataset('file.parquet', format='parquet').to_table()
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-15784
> >
> > On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <xz...@gmail.com> wrote:
> > >
> > > I am using a public benchmark. The origin file is https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2 . I used pyarrow version 7.0.0 and pq.write_table api to write the csv file as a parquet file, with compression=snappy and use_dictionary=true. The data has ~20M rows and 43 columns. So there is only one row group with row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on local disk.
> > >
> > > Weston Pace <we...@gmail.com> 于2022年2月24日周四 16:45写道：
> > >>
> > >> That doesn't really solve it but just confirms that the problem is the newer datasets logic.  I need more information to really know what is going on as this still seems like a problem.
> > >>
> > >> How many row groups and how many columns does your file have?  Or do you have a sample parquet file that shows this issue?
> > >>
> > >> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <xz...@gmail.com> wrote:
> > >>>
> > >>> use_legacy_dataset=True fixes the problem. Could you explain a little about the reason? Thanks!
> > >>>
> > >>> Weston Pace <we...@gmail.com> 于2022年2月24日周四 13:44写道：
> > >>>>
> > >>>> What version of pyarrow are you using?  What's your OS?  Is the file on a local disk or S3?  How many row groups are in your file?
> > >>>>
> > >>>> A difference of that much is not expected.  However, they do use different infrastructure under the hood.  Do you also get the faster performance with pq.read_table(use_legacy_dataset=True) as well.
> > >>>>
> > >>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <xz...@gmail.com> wrote:
> > >>>>>
> > >>>>> Hi all, I found that for the same parquet file, using pq.ParquetFile(file_name).read() takes 6s while pq.read_table(file_name) takes 17s. How do those two apis differ? I thought they use the same internals but it seems not. The parquet file is 865MB, snappy compression and enable dictionary. All other settings are default, writing with pyarrow.

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Posted by Wes McKinney <we...@gmail.com>.

Since this isn't the first time this specific issue has happened in a
major release, is there a way that a test or benchmark regression
check could be introduced to prevent this category of problem in the
future?

On Thu, Feb 24, 2022 at 9:48 PM Weston Pace <we...@gmail.com> wrote:
>
> Thanks for reporting this.  It seems a regression crept into 7.0.0
> that accidentally disabled parallel column decoding when
> pyarrow.parquet.read_table is called with a single file.  I have filed
> [1] and should have a fix for it before the next release.  As a
> workaround you can use the datasets API directly, this is already what
> pyarrow.parquet.read_table is using under the hood when
> use_legacy_dataset=False.  Or you can continue using
> use_legacy_dataset=True.
>
> import pyarrow.dataset as ds
> table = ds.dataset('file.parquet', format='parquet').to_table()
>
> [1] https://issues.apache.org/jira/browse/ARROW-15784
>
> On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <xz...@gmail.com> wrote:
> >
> > I am using a public benchmark. The origin file is https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2 . I used pyarrow version 7.0.0 and pq.write_table api to write the csv file as a parquet file, with compression=snappy and use_dictionary=true. The data has ~20M rows and 43 columns. So there is only one row group with row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on local disk.
> >
> > Weston Pace <we...@gmail.com> 于2022年2月24日周四 16:45写道：
> >>
> >> That doesn't really solve it but just confirms that the problem is the newer datasets logic.  I need more information to really know what is going on as this still seems like a problem.
> >>
> >> How many row groups and how many columns does your file have?  Or do you have a sample parquet file that shows this issue?
> >>
> >> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <xz...@gmail.com> wrote:
> >>>
> >>> use_legacy_dataset=True fixes the problem. Could you explain a little about the reason? Thanks!
> >>>
> >>> Weston Pace <we...@gmail.com> 于2022年2月24日周四 13:44写道：
> >>>>
> >>>> What version of pyarrow are you using?  What's your OS?  Is the file on a local disk or S3?  How many row groups are in your file?
> >>>>
> >>>> A difference of that much is not expected.  However, they do use different infrastructure under the hood.  Do you also get the faster performance with pq.read_table(use_legacy_dataset=True) as well.
> >>>>
> >>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <xz...@gmail.com> wrote:
> >>>>>
> >>>>> Hi all, I found that for the same parquet file, using pq.ParquetFile(file_name).read() takes 6s while pq.read_table(file_name) takes 17s. How do those two apis differ? I thought they use the same internals but it seems not. The parquet file is 865MB, snappy compression and enable dictionary. All other settings are default, writing with pyarrow.

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Posted by Weston Pace <we...@gmail.com>.

Thanks for reporting this.  It seems a regression crept into 7.0.0
that accidentally disabled parallel column decoding when
pyarrow.parquet.read_table is called with a single file.  I have filed
[1] and should have a fix for it before the next release.  As a
workaround you can use the datasets API directly, this is already what
pyarrow.parquet.read_table is using under the hood when
use_legacy_dataset=False.  Or you can continue using
use_legacy_dataset=True.

import pyarrow.dataset as ds
table = ds.dataset('file.parquet', format='parquet').to_table()

[1] https://issues.apache.org/jira/browse/ARROW-15784

On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <xz...@gmail.com> wrote:
>
> I am using a public benchmark. The origin file is https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2 . I used pyarrow version 7.0.0 and pq.write_table api to write the csv file as a parquet file, with compression=snappy and use_dictionary=true. The data has ~20M rows and 43 columns. So there is only one row group with row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on local disk.
>
> Weston Pace <we...@gmail.com> 于2022年2月24日周四 16:45写道：
>>
>> That doesn't really solve it but just confirms that the problem is the newer datasets logic.  I need more information to really know what is going on as this still seems like a problem.
>>
>> How many row groups and how many columns does your file have?  Or do you have a sample parquet file that shows this issue?
>>
>> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <xz...@gmail.com> wrote:
>>>
>>> use_legacy_dataset=True fixes the problem. Could you explain a little about the reason? Thanks!
>>>
>>> Weston Pace <we...@gmail.com> 于2022年2月24日周四 13:44写道：
>>>>
>>>> What version of pyarrow are you using?  What's your OS?  Is the file on a local disk or S3?  How many row groups are in your file?
>>>>
>>>> A difference of that much is not expected.  However, they do use different infrastructure under the hood.  Do you also get the faster performance with pq.read_table(use_legacy_dataset=True) as well.
>>>>
>>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <xz...@gmail.com> wrote:
>>>>>
>>>>> Hi all, I found that for the same parquet file, using pq.ParquetFile(file_name).read() takes 6s while pq.read_table(file_name) takes 17s. How do those two apis differ? I thought they use the same internals but it seems not. The parquet file is 865MB, snappy compression and enable dictionary. All other settings are default, writing with pyarrow.

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Posted by Shawn Zeng <xz...@gmail.com>.

I am using a public benchmark. The origin file is
https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2
. I used pyarrow version 7.0.0 and pq.write_table api to write the csv file
as a parquet file, with compression=snappy and use_dictionary=true. The
data has ~20M rows and 43 columns. So there is only one row group with
row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on
local disk.

Weston Pace <we...@gmail.com> 于2022年2月24日周四 16:45写道：

> That doesn't really solve it but just confirms that the problem is the
> newer datasets logic.  I need more information to really know what is going
> on as this still seems like a problem.
>
> How many row groups and how many columns does your file have?  Or do you
> have a sample parquet file that shows this issue?
>
> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <xz...@gmail.com> wrote:
>
>> use_legacy_dataset=True fixes the problem. Could you explain a little
>> about the reason? Thanks!
>>
>> Weston Pace <we...@gmail.com> 于2022年2月24日周四 13:44写道：
>>
>>> What version of pyarrow are you using?  What's your OS?  Is the file on
>>> a local disk or S3?  How many row groups are in your file?
>>>
>>> A difference of that much is not expected.  However, they do use
>>> different infrastructure under the hood.  Do you also get the faster
>>> performance with pq.read_table(use_legacy_dataset=True) as well.
>>>
>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <xz...@gmail.com> wrote:
>>>
>>>> Hi all, I found that for the same parquet file,
>>>> using pq.ParquetFile(file_name).read() takes 6s while
>>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
>>>> they use the same internals but it seems not. The parquet file is 865MB,
>>>> snappy compression and enable dictionary. All other settings are default,
>>>> writing with pyarrow.
>>>>
>>>

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Posted by Weston Pace <we...@gmail.com>.

That doesn't really solve it but just confirms that the problem is the
newer datasets logic.  I need more information to really know what is going
on as this still seems like a problem.

How many row groups and how many columns does your file have?  Or do you
have a sample parquet file that shows this issue?

On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <xz...@gmail.com> wrote:

> use_legacy_dataset=True fixes the problem. Could you explain a little
> about the reason? Thanks!
>
> Weston Pace <we...@gmail.com> 于2022年2月24日周四 13:44写道：
>
>> What version of pyarrow are you using?  What's your OS?  Is the file on a
>> local disk or S3?  How many row groups are in your file?
>>
>> A difference of that much is not expected.  However, they do use
>> different infrastructure under the hood.  Do you also get the faster
>> performance with pq.read_table(use_legacy_dataset=True) as well.
>>
>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <xz...@gmail.com> wrote:
>>
>>> Hi all, I found that for the same parquet file,
>>> using pq.ParquetFile(file_name).read() takes 6s while
>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
>>> they use the same internals but it seems not. The parquet file is 865MB,
>>> snappy compression and enable dictionary. All other settings are default,
>>> writing with pyarrow.
>>>
>>

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Posted by Shawn Zeng <xz...@gmail.com>.

use_legacy_dataset=True fixes the problem. Could you explain a little about
the reason? Thanks!

Weston Pace <we...@gmail.com> 于2022年2月24日周四 13:44写道：

> What version of pyarrow are you using?  What's your OS?  Is the file on a
> local disk or S3?  How many row groups are in your file?
>
> A difference of that much is not expected.  However, they do use different
> infrastructure under the hood.  Do you also get the faster performance with
> pq.read_table(use_legacy_dataset=True) as well.
>
> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <xz...@gmail.com> wrote:
>
>> Hi all, I found that for the same parquet file,
>> using pq.ParquetFile(file_name).read() takes 6s while
>> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
>> they use the same internals but it seems not. The parquet file is 865MB,
>> snappy compression and enable dictionary. All other settings are default,
>> writing with pyarrow.
>>
>

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Posted by Weston Pace <we...@gmail.com>.

What version of pyarrow are you using?  What's your OS?  Is the file on a
local disk or S3?  How many row groups are in your file?

A difference of that much is not expected.  However, they do use different
infrastructure under the hood.  Do you also get the faster performance with
pq.read_table(use_legacy_dataset=True) as well.

On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <xz...@gmail.com> wrote:

> Hi all, I found that for the same parquet file,
> using pq.ParquetFile(file_name).read() takes 6s while
> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
> they use the same internals but it seems not. The parquet file is 865MB,
> snappy compression and enable dictionary. All other settings are default,
> writing with pyarrow.
>