You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Maarten Ballintijn <ma...@xs4all.nl> on 2019/09/23 22:52:29 UTC

Parquet file reading performance

Greetings,

We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
They have about 20,000 columns with integer column labels, and data with a dtype of float32.

We’d like to store these dataframes with parquet, using the ability to read a subset of columns and to store meta-data with the file.

We’ve found the reading performance less than expected compared to the published benchmarks (e.g. Wes’ blog post).

Using a modified version of his script we did reproduce his results (~ 1GB/s for high entropy, no dict on MacBook pro)
 
But there seem to be three factors that contribute to the slowdown for our datasets:

- DateTimeIndex is much slower then a Int index (we see about a factor 5).
- The number of columns impact reading speed significantly (factor ~2 going from 16 to 16,000 columns)
- The ‘use_pandas_metadata=True’ slows down reading significantly and appears unnecessary? (about 40%)

Are there ways we could speedup the reading? Should we use a different layout?

Thanks for your help and insights!

Cheers,
Maarten 


ps. the routines we used:
def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
    table = pa.Table.from_pandas(df)
    pq.write_table(table, fname, use_dictionary=False, compression=None)
    return

def read_arrow_parquet(fname: str) -> pd.DataFrame:
    table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
    df = table.to_pandas()
    return df


Re: Parquet file reading performance

Posted by Wes McKinney <we...@gmail.com>.
On Sat, Sep 28, 2019 at 3:16 PM Maarten Ballintijn <ma...@xs4all.nl> wrote:
>
> Hi Joris,
>
> Thanks for your detailed analysis!
>
> We can leave the impact of the large DateTimeIndex on the performance for another time.
> (Notes: my laptop has sufficient memory to support it, no error is thrown, the
> resulting DateTimeIndex from the expression is identical to your version or the other version
> in the test. The large DateTimeIndex is released long before the tests happen, yet it has a
> massive impact?? It feels like something is broken)
>
>
> Thanks for clearly demonstrating that the main the issue is with to_pandas()
> That’s very unexpected, in the ’ns’ case I would expect no overhead.
> And even with the ‘us’ case it's only two vector compares and a factor multiply, no?
> Also, Timestamps are quite ubiquitous :-)
>
>
> This leaves me with the following questions:
>
> - Who should I talk to to get this resolved in Pandas?
>
> - Where do I find out more about Parquet v2? And more generally is there an RFC (or similar)
> document that defines the Parquet file format and API?

The one practical barrier to using Parquet V2 endogenously in Python
is resolving PARQUET-458, i.e. implementing the V2 data page encoding
correctly.

If you write V2 files, you may or may not be able to read them
everywhere. So if you are striving for compatibility across many
processing frameworks I would recommend sticking with V1.

For other questions I direct you to dev@parquet.apache.org

> - Do you think it would be possible to take the DateTime column out of Arrow into numpy
> and transform it the to make it more amenable to Pandas? and possibly even for the value columns?
>
> Thanks again and have a great weekend!
> Maarten.
>
>
>
>
> > On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche <jo...@gmail.com> wrote:
> >
> > From looking a little bit further into this, it seems that it is mainly
> > pandas who is slower in creating a Series from an array of datetime64
> > compared from an array of ints.
> > And especially if it is not nanosecond resolution:
> >
> > In [29]: a_int = pa.array(np.arange(100000))
> >
> > In [30]: %timeit a_int.to_pandas()
> > 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> >
> > In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
> > freq='S'))
> >
> > In [32]: %timeit a_datetime.to_pandas()
> > 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> >
> > In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
> > freq='S'), pa.timestamp('us'))
> >
> > In [34]: %timeit a_datetime_us.to_pandas()
> > 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >
> > Creating the datetime64 array inside pyarrow is also a bit slower compared
> > to int (causing the slower conversion of a_datetime), but the above
> > difference between between nanosecond and microsecond resolution is largely
> > due to pandas, not pyarrow (because pandas needs to convert the
> > microseconds to nanoseconds, and during that conversion will also check
> > that no datetimes were out of bounds for this resolution).
> >
> > And in parquet, the datetime data of the index column will be stored in
> > microsecond resolution (even if the original pandas data was nanosecond
> > resolution). And the slower reading of the parquet file with datetime index
> > is thus almost entirely due to the above difference in timing of converting
> > the int or datetime index column to pandas.
> > Parquet nowadays actually supports storing nanosecond resolution, and this
> > can be triggered in pyarrow by passing version="2.0" to pq.write_table (but
> > last what I heard this version is not yet considered production ready).
> >
> > Joris
> >
> > On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
> > jorisvandenbossche@gmail.com> wrote:
> >
> >> Hi Maarten,
> >>
> >> Thanks for the reproducible script. I ran it on my laptop on pyarrow
> >> master, and not seeing the difference between both datetime indexes:
> >>
> >> Versions:
> >> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
> >> 23:01:00)
> >> [GCC 7.3.0] on linux
> >> numpy:    1.16.4
> >> pandas:   0.26.0.dev0+447.gc168ecf26
> >> pyarrow:  0.14.1.dev642+g7f2d637db
> >>
> >> 1073741824 float64 8388608 16
> >> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
> >> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
> >> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
> >> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
> >> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
> >> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
> >> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
> >> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
> >> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
> >>
> >> The only change I needed to make in the script to get it running (within
> >> my memory limits) was the creation of the second DatetimeIndex
> >> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
> >> 1.5 billion elements, while only the last part of it is used. So changed
> >> that to index = pd.date_range('2018-01-01', '2019-09-01',
> >> freq='S').array[-rows:])
> >>
> >> The datetime index reading in general is still slower as the int index.
> >> But doing a bit more detailed timings, and it seems this is not due to the
> >> reading of parquet, but the conversion of arrow to pandas (using the files
> >> from the benchmark):
> >>
> >> In [1]: import pyarrow.parquet as pq
> >>
> >> In [4]: %timeit pq.read_table('testdata.int.parquet')
> >> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> In [5]: %timeit pq.read_table('testdata.dt.parquet')
> >> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> In [6]: table_int = pq.read_table('testdata.int.parquet')
> >>
> >> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
> >>
> >> In [8]: %timeit table_int.to_pandas()
> >> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>
> >> In [9]: %timeit table_datetime.to_pandas()
> >> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> So you can see that the parquet reading part is basically identical, but
> >> the conversion to pandas is much slower for the datetime-index case.
> >> I will try to look into that code path to see what makes this so much
> >> slower.
> >>
> >> Joris
> >>
> >>
> >> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <ma...@xs4all.nl>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> The code to show the performance issue with DateTimeIndex is at:
> >>>
> >>>        https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
> >>>
> >>> It shows three case 0) int index, 1) datetime index, 2) date time index
> >>> created in a slightly roundabout way
> >>>
> >>> I’m a little confused by the two date time cases. Case 2) is much slower
> >>> but the df compares identical to case 1)
> >>> (I originally used something like 2) to match our specific data. I don’t
> >>> see why it behaves differently??)
> >>>
> >>> The timings I find are:
> >>>
> >>> 1073741824 float64 8388608 16
> >>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
> >>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
> >>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
> >>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
> >>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
> >>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
> >>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
> >>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
> >>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
> >>>
> >>> Case 0, int index.  This is all great.
> >>> Case 1, date time index. We loose almost half the speed. Given that a
> >>> datetime is only scaled from Pandas IIRC that seems like a lot?
> >>> Case  3, other datetime index. No idea what is going on.
> >>>
> >>> Any insights are much appreciated.
> >>>
> >>> Cheers,
> >>> Maarten.
> >>>
> >>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <we...@gmail.com> wrote:
> >>>>
> >>>> hi
> >>>>
> >>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maartenb@xs4all.nl
> >>> <ma...@xs4all.nl>> wrote:
> >>>>>
> >>>>> Hi Wes,
> >>>>>
> >>>>> Thanks for your quick response.
> >>>>>
> >>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
> >>>>>
> >>>>> numpy:           1.16.5
> >>>>> pandas:          0.25.1
> >>>>> pyarrow:         0.14.1
> >>>>>
> >>>>> It looks like 0.15 is close, so I can wait for that.
> >>>>>
> >>>>> Theoretically I see three components driving the performance:
> >>>>> 1) The cost of locating the column (directory overhead)
> >>>>> 2) The overhead of reading a single column. (reading and processing
> >>> meta data, setting up for reading)
> >>>>> 3) Bulk reading and unmarshalling/decoding the data.
> >>>>>
> >>>>> Only 1) would be impacted by the number of columns, but if you’re
> >>> reading everything ideally this would not be a problem.
> >>>>
> >>>> The problem is more nuanced than that. Parquet's metadata is somewhat
> >>>> "heavy" at the column level. So when you're writing thousands of
> >>>> columns, the fixed overhead associated with reading a single column
> >>>> becomes problematic. There are several data structures associated with
> >>>> decoding a column have a fixed setup and teardown cost. Even if there
> >>>> is 1 millisecond of fixed overhead related to reading a column (I
> >>>> don't know what the number is exactly) then reading 10,000 columns has
> >>>> 10 seconds of unavoidable overhead. It might be useful for us to
> >>>> quantify and communicate the expected overhead when metadata and
> >>>> decoding is taken into account. Simply put having more than 1000
> >>>> columns is not advisable.
> >>>>
> >>>>> Based on an initial cursory look at the Parquet format I guess the
> >>> index and the column meta-data might need to be read in full so I can see
> >>> how that might slow down reading only a few columns out of a large set. But
> >>> that was not really the case here?
> >>>>>
> >>>>> What would you suggest for looking into the date index slow-down?
> >>>>
> >>>> Can you show a code example to make things easier for us to see what
> >>>> you're seeing?
> >>>>
> >>>>>
> >>>>> Cheers,
> >>>>> Maarten.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> hi Maarten,
> >>>>>>
> >>>>>> Are you using the master branch or 0.14.1? There are a number of
> >>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
> >>>>>> master branch, to appear as 0.15.0 relatively soon.
> >>>>>>
> >>>>>> As a file format, Parquet (and columnar formats in general) is not
> >>>>>> known to perform well with more than 1000 columns.
> >>>>>>
> >>>>>> On the other items, we'd be happy to work with you to dig through the
> >>>>>> performance issues you're seeing.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Wes
> >>>>>>
> >>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
> >>> maartenb@xs4all.nl> wrote:
> >>>>>>>
> >>>>>>> Greetings,
> >>>>>>>
> >>>>>>> We have Pandas DataFrames with typically about 6,000 rows using
> >>> DateTimeIndex.
> >>>>>>> They have about 20,000 columns with integer column labels, and data
> >>> with a dtype of float32.
> >>>>>>>
> >>>>>>> We’d like to store these dataframes with parquet, using the ability
> >>> to read a subset of columns and to store meta-data with the file.
> >>>>>>>
> >>>>>>> We’ve found the reading performance less than expected compared to
> >>> the published benchmarks (e.g. Wes’ blog post).
> >>>>>>>
> >>>>>>> Using a modified version of his script we did reproduce his results
> >>> (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>>>>>>
> >>>>>>> But there seem to be three factors that contribute to the slowdown
> >>> for our datasets:
> >>>>>>>
> >>>>>>> - DateTimeIndex is much slower then a Int index (we see about a
> >>> factor 5).
> >>>>>>> - The number of columns impact reading speed significantly (factor
> >>> ~2 going from 16 to 16,000 columns)
> >>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly
> >>> and appears unnecessary? (about 40%)
> >>>>>>>
> >>>>>>> Are there ways we could speedup the reading? Should we use a
> >>> different layout?
> >>>>>>>
> >>>>>>> Thanks for your help and insights!
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Maarten
> >>>>>>>
> >>>>>>>
> >>>>>>> ps. the routines we used:
> >>>>>>>
> >>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>>>>>>  table = pa.Table.from_pandas(df)
> >>>>>>>  pq.write_table(table, fname, use_dictionary=False,
> >>> compression=None)
> >>>>>>>  return
> >>>>>>>
> >>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>>>>>>  table = pq.read_table(fname, use_pandas_metadata=False,
> >>> use_threads=True)
> >>>>>>>  df = table.to_pandas()
> >>>>>>>  return df
> >>>
> >>>
>

Re: Parquet file reading performance

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Tue, 1 Oct 2019 at 21:03, Maarten Ballintijn <ma...@xs4all.nl> wrote:

>
> I ran cProfile to understand better what is going on in Pandas. Using your
> code below I find that
> Pandas runs a loop over generic the datetime64 conversion in case the
> datetime64 is not in ’ns’.
> The conversion unpacks the time into a date-time struct and converts the
> date-time struct back
> into a Timestamp. This is clearly slow and unnecessary.
>
> Adding a dedicated us->ns conversion should be possible with much better
> performance.
> (either in Cython or C).
>
> I can add a request for this on Pandas-issues unless someone thinks this
> is not a good idea?
>
> Hi Maarten,

Also speaking as a pandas developer, I think it is certainly worth raising
this on the pandas issue tracker.


> Somewhat surprisingly, specifying the dtype on the pd.Series constructor
> triggers another code path
> which is a further factor ~5 slower.
>

Hmm, it seems in that case it is falling back to `pd.to_datetime`, which
has apparently no special support at all for non-ns datetime64 arrays, and
converts it first to an object array to then parse it again ..

Joris

>
>
> > On Oct 1, 2019, at 7:07 AM, Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
> >
> > Some answers to the other questions:
> >
> > On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn <ma...@xs4all.nl>
> wrote:
> >
> >> ...
> >> This leaves me with the following questions:
> >>
> >> - Who should I talk to to get this resolved in Pandas?
> >>
> >> You can open an issue on their tracker:
> > https://github.com/pandas-dev/pandas/issues/
> >
> >
> >> - Do you think it would be possible to take the DateTime column out of
> >> Arrow into numpy
> >> and transform it the to make it more amenable to Pandas? and possibly
> even
> >> for the value columns?
> >>
> >
> > The reason that this non-ns column takes more time is due to the
> conversion
> > to nanoseconds.
> > So one way this could be to already cast your arrow table to ns before
> > converting to pandas (this is implemented in arrow, but currently does
> not
> > check of out of bounds values; I opened
> > https://issues.apache.org/jira/browse/ARROW-6704 for this).
> >
> > But, I am not sure it can be improved with a big factor on the pandas
> side.
> > Let's look at some timings using:
> >
> > arr_int = np.arange(100000)
> > arr_ns = np.asarray(pd.date_range("2012", periods=100000, freq='S'))
> > arr_us = arr_ns.astype("datetime64[us]")
> >
> > conversion from those arrays to a pandas Series only takes more time for
> > the microsecond datetimes:
> >
> > In [6]: %timeit pd.Series(arr_int)
> > 47.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> >
> > In [7]: %timeit pd.Series(arr_ns)
> > 27.7 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> >
> > In [8]: %timeit pd.Series(arr_us)
> > 6.22 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >
> > But just converting microseconds to nanoseconds also already takes
> almost a
> > millisecond on this data (with numpy, but arrow is not faster):
> >
> > In [15]: %timeit arr_us.astype("datetime64[ns]")
> > 742 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> >
> > So it should certainly be possible to get those 6ms down to something
> > around 1ms (and which can be a nice improvement! you're welcome to open
> an
> > issue on the pandas issue tracker for it), but it will still be factor of
> > 20 to 30 slower than the nanosecond case.
> >
> > Joris
> >
> >
> >>> On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche <
> >> jorisvandenbossche@gmail.com> wrote:
> >>>
> >>> From looking a little bit further into this, it seems that it is mainly
> >>> pandas who is slower in creating a Series from an array of datetime64
> >>> compared from an array of ints.
> >>> And especially if it is not nanosecond resolution:
> >>>
> >>> In [29]: a_int = pa.array(np.arange(100000))
> >>>
> >>> In [30]: %timeit a_int.to_pandas()
> >>> 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops
> each)
> >>>
> >>> In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
> >>> freq='S'))
> >>>
> >>> In [32]: %timeit a_datetime.to_pandas()
> >>> 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
> each)
> >>>
> >>> In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
> >>> freq='S'), pa.timestamp('us'))
> >>>
> >>> In [34]: %timeit a_datetime_us.to_pandas()
> >>> 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>>
> >>> Creating the datetime64 array inside pyarrow is also a bit slower
> >> compared
> >>> to int (causing the slower conversion of a_datetime), but the above
> >>> difference between between nanosecond and microsecond resolution is
> >> largely
> >>> due to pandas, not pyarrow (because pandas needs to convert the
> >>> microseconds to nanoseconds, and during that conversion will also check
> >>> that no datetimes were out of bounds for this resolution).
> >>>
> >>> And in parquet, the datetime data of the index column will be stored in
> >>> microsecond resolution (even if the original pandas data was nanosecond
> >>> resolution). And the slower reading of the parquet file with datetime
> >> index
> >>> is thus almost entirely due to the above difference in timing of
> >> converting
> >>> the int or datetime index column to pandas.
> >>> Parquet nowadays actually supports storing nanosecond resolution, and
> >> this
> >>> can be triggered in pyarrow by passing version="2.0" to pq.write_table
> >> (but
> >>> last what I heard this version is not yet considered production ready).
> >>>
> >>> Joris
> >>>
> >>> On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
> >>> jorisvandenbossche@gmail.com> wrote:
> >>>
> >>>> Hi Maarten,
> >>>>
> >>>> Thanks for the reproducible script. I ran it on my laptop on pyarrow
> >>>> master, and not seeing the difference between both datetime indexes:
> >>>>
> >>>> Versions:
> >>>> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
> >>>> 23:01:00)
> >>>> [GCC 7.3.0] on linux
> >>>> numpy:    1.16.4
> >>>> pandas:   0.26.0.dev0+447.gc168ecf26
> >>>> pyarrow:  0.14.1.dev642+g7f2d637db
> >>>>
> >>>> 1073741824 float64 8388608 16
> >>>> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
> >>>> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
> >>>> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
> >>>> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
> >>>> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
> >>>> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
> >>>> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
> >>>> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
> >>>> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
> >>>>
> >>>> The only change I needed to make in the script to get it running
> (within
> >>>> my memory limits) was the creation of the second DatetimeIndex
> >>>> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index
> of
> >>>> 1.5 billion elements, while only the last part of it is used. So
> changed
> >>>> that to index = pd.date_range('2018-01-01', '2019-09-01',
> >>>> freq='S').array[-rows:])
> >>>>
> >>>> The datetime index reading in general is still slower as the int
> index.
> >>>> But doing a bit more detailed timings, and it seems this is not due to
> >> the
> >>>> reading of parquet, but the conversion of arrow to pandas (using the
> >> files
> >>>> from the benchmark):
> >>>>
> >>>> In [1]: import pyarrow.parquet as pq
> >>>>
> >>>> In [4]: %timeit pq.read_table('testdata.int.parquet')
> >>>> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>>>
> >>>> In [5]: %timeit pq.read_table('testdata.dt.parquet')
> >>>> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>>>
> >>>> In [6]: table_int = pq.read_table('testdata.int.parquet')
> >>>>
> >>>> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
> >>>>
> >>>> In [8]: %timeit table_int.to_pandas()
> >>>> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>>>
> >>>> In [9]: %timeit table_datetime.to_pandas()
> >>>> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>>>
> >>>> So you can see that the parquet reading part is basically identical,
> but
> >>>> the conversion to pandas is much slower for the datetime-index case.
> >>>> I will try to look into that code path to see what makes this so much
> >>>> slower.
> >>>>
> >>>> Joris
> >>>>
> >>>>
> >>>> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <ma...@xs4all.nl>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> The code to show the performance issue with DateTimeIndex is at:
> >>>>>
> >>>>>
> >> https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
> >>>>>
> >>>>> It shows three case 0) int index, 1) datetime index, 2) date time
> index
> >>>>> created in a slightly roundabout way
> >>>>>
> >>>>> I’m a little confused by the two date time cases. Case 2) is much
> >> slower
> >>>>> but the df compares identical to case 1)
> >>>>> (I originally used something like 2) to match our specific data. I
> >> don’t
> >>>>> see why it behaves differently??)
> >>>>>
> >>>>> The timings I find are:
> >>>>>
> >>>>> 1073741824 float64 8388608 16
> >>>>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
> >>>>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
> >>>>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
> >>>>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
> >>>>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
> >>>>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
> >>>>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
> >>>>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
> >>>>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
> >>>>>
> >>>>> Case 0, int index.  This is all great.
> >>>>> Case 1, date time index. We loose almost half the speed. Given that a
> >>>>> datetime is only scaled from Pandas IIRC that seems like a lot?
> >>>>> Case  3, other datetime index. No idea what is going on.
> >>>>>
> >>>>> Any insights are much appreciated.
> >>>>>
> >>>>> Cheers,
> >>>>> Maarten.
> >>>>>
> >>>>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <we...@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>> hi
> >>>>>>
> >>>>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <
> >> maartenb@xs4all.nl
> >>>>> <ma...@xs4all.nl>> wrote:
> >>>>>>>
> >>>>>>> Hi Wes,
> >>>>>>>
> >>>>>>> Thanks for your quick response.
> >>>>>>>
> >>>>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
> >>>>>>>
> >>>>>>> numpy:           1.16.5
> >>>>>>> pandas:          0.25.1
> >>>>>>> pyarrow:         0.14.1
> >>>>>>>
> >>>>>>> It looks like 0.15 is close, so I can wait for that.
> >>>>>>>
> >>>>>>> Theoretically I see three components driving the performance:
> >>>>>>> 1) The cost of locating the column (directory overhead)
> >>>>>>> 2) The overhead of reading a single column. (reading and processing
> >>>>> meta data, setting up for reading)
> >>>>>>> 3) Bulk reading and unmarshalling/decoding the data.
> >>>>>>>
> >>>>>>> Only 1) would be impacted by the number of columns, but if you’re
> >>>>> reading everything ideally this would not be a problem.
> >>>>>>
> >>>>>> The problem is more nuanced than that. Parquet's metadata is
> somewhat
> >>>>>> "heavy" at the column level. So when you're writing thousands of
> >>>>>> columns, the fixed overhead associated with reading a single column
> >>>>>> becomes problematic. There are several data structures associated
> with
> >>>>>> decoding a column have a fixed setup and teardown cost. Even if
> there
> >>>>>> is 1 millisecond of fixed overhead related to reading a column (I
> >>>>>> don't know what the number is exactly) then reading 10,000 columns
> has
> >>>>>> 10 seconds of unavoidable overhead. It might be useful for us to
> >>>>>> quantify and communicate the expected overhead when metadata and
> >>>>>> decoding is taken into account. Simply put having more than 1000
> >>>>>> columns is not advisable.
> >>>>>>
> >>>>>>> Based on an initial cursory look at the Parquet format I guess the
> >>>>> index and the column meta-data might need to be read in full so I can
> >> see
> >>>>> how that might slow down reading only a few columns out of a large
> >> set. But
> >>>>> that was not really the case here?
> >>>>>>>
> >>>>>>> What would you suggest for looking into the date index slow-down?
> >>>>>>
> >>>>>> Can you show a code example to make things easier for us to see what
> >>>>>> you're seeing?
> >>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Maarten.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> hi Maarten,
> >>>>>>>>
> >>>>>>>> Are you using the master branch or 0.14.1? There are a number of
> >>>>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
> >>>>>>>> master branch, to appear as 0.15.0 relatively soon.
> >>>>>>>>
> >>>>>>>> As a file format, Parquet (and columnar formats in general) is not
> >>>>>>>> known to perform well with more than 1000 columns.
> >>>>>>>>
> >>>>>>>> On the other items, we'd be happy to work with you to dig through
> >> the
> >>>>>>>> performance issues you're seeing.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Wes
> >>>>>>>>
> >>>>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
> >>>>> maartenb@xs4all.nl> wrote:
> >>>>>>>>>
> >>>>>>>>> Greetings,
> >>>>>>>>>
> >>>>>>>>> We have Pandas DataFrames with typically about 6,000 rows using
> >>>>> DateTimeIndex.
> >>>>>>>>> They have about 20,000 columns with integer column labels, and
> data
> >>>>> with a dtype of float32.
> >>>>>>>>>
> >>>>>>>>> We’d like to store these dataframes with parquet, using the
> ability
> >>>>> to read a subset of columns and to store meta-data with the file.
> >>>>>>>>>
> >>>>>>>>> We’ve found the reading performance less than expected compared
> to
> >>>>> the published benchmarks (e.g. Wes’ blog post).
> >>>>>>>>>
> >>>>>>>>> Using a modified version of his script we did reproduce his
> results
> >>>>> (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>>>>>>>>
> >>>>>>>>> But there seem to be three factors that contribute to the
> slowdown
> >>>>> for our datasets:
> >>>>>>>>>
> >>>>>>>>> - DateTimeIndex is much slower then a Int index (we see about a
> >>>>> factor 5).
> >>>>>>>>> - The number of columns impact reading speed significantly
> (factor
> >>>>> ~2 going from 16 to 16,000 columns)
> >>>>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly
> >>>>> and appears unnecessary? (about 40%)
> >>>>>>>>>
> >>>>>>>>> Are there ways we could speedup the reading? Should we use a
> >>>>> different layout?
> >>>>>>>>>
> >>>>>>>>> Thanks for your help and insights!
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Maarten
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ps. the routines we used:
> >>>>>>>>>
> >>>>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>>>>>>>> table = pa.Table.from_pandas(df)
> >>>>>>>>> pq.write_table(table, fname, use_dictionary=False,
> >>>>> compression=None)
> >>>>>>>>> return
> >>>>>>>>>
> >>>>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>>>>>>>> table = pq.read_table(fname, use_pandas_metadata=False,
> >>>>> use_threads=True)
> >>>>>>>>> df = table.to_pandas()
> >>>>>>>>> return df
> >>>>>
> >>>>>
> >>
> >>
>
>

Re: Parquet file reading performance

Posted by Maarten Ballintijn <ma...@xs4all.nl>.
I ran cProfile to understand better what is going on in Pandas. Using your code below I find that
Pandas runs a loop over generic the datetime64 conversion in case the datetime64 is not in ’ns’.
The conversion unpacks the time into a date-time struct and converts the date-time struct back
into a Timestamp. This is clearly slow and unnecessary.

Adding a dedicated us->ns conversion should be possible with much better performance.
(either in Cython or C).

I can add a request for this on Pandas-issues unless someone thinks this is not a good idea?

Somewhat surprisingly, specifying the dtype on the pd.Series constructor triggers another code path
which is a further factor ~5 slower.


> On Oct 1, 2019, at 7:07 AM, Joris Van den Bossche <jo...@gmail.com> wrote:
> 
> Some answers to the other questions:
> 
> On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn <ma...@xs4all.nl> wrote:
> 
>> ...
>> This leaves me with the following questions:
>> 
>> - Who should I talk to to get this resolved in Pandas?
>> 
>> You can open an issue on their tracker:
> https://github.com/pandas-dev/pandas/issues/
> 
> 
>> - Do you think it would be possible to take the DateTime column out of
>> Arrow into numpy
>> and transform it the to make it more amenable to Pandas? and possibly even
>> for the value columns?
>> 
> 
> The reason that this non-ns column takes more time is due to the conversion
> to nanoseconds.
> So one way this could be to already cast your arrow table to ns before
> converting to pandas (this is implemented in arrow, but currently does not
> check of out of bounds values; I opened
> https://issues.apache.org/jira/browse/ARROW-6704 for this).
> 
> But, I am not sure it can be improved with a big factor on the pandas side.
> Let's look at some timings using:
> 
> arr_int = np.arange(100000)
> arr_ns = np.asarray(pd.date_range("2012", periods=100000, freq='S'))
> arr_us = arr_ns.astype("datetime64[us]")
> 
> conversion from those arrays to a pandas Series only takes more time for
> the microsecond datetimes:
> 
> In [6]: %timeit pd.Series(arr_int)
> 47.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 
> In [7]: %timeit pd.Series(arr_ns)
> 27.7 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 
> In [8]: %timeit pd.Series(arr_us)
> 6.22 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> 
> But just converting microseconds to nanoseconds also already takes almost a
> millisecond on this data (with numpy, but arrow is not faster):
> 
> In [15]: %timeit arr_us.astype("datetime64[ns]")
> 742 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> 
> So it should certainly be possible to get those 6ms down to something
> around 1ms (and which can be a nice improvement! you're welcome to open an
> issue on the pandas issue tracker for it), but it will still be factor of
> 20 to 30 slower than the nanosecond case.
> 
> Joris
> 
> 
>>> On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche <
>> jorisvandenbossche@gmail.com> wrote:
>>> 
>>> From looking a little bit further into this, it seems that it is mainly
>>> pandas who is slower in creating a Series from an array of datetime64
>>> compared from an array of ints.
>>> And especially if it is not nanosecond resolution:
>>> 
>>> In [29]: a_int = pa.array(np.arange(100000))
>>> 
>>> In [30]: %timeit a_int.to_pandas()
>>> 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> 
>>> In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
>>> freq='S'))
>>> 
>>> In [32]: %timeit a_datetime.to_pandas()
>>> 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> 
>>> In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
>>> freq='S'), pa.timestamp('us'))
>>> 
>>> In [34]: %timeit a_datetime_us.to_pandas()
>>> 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> 
>>> Creating the datetime64 array inside pyarrow is also a bit slower
>> compared
>>> to int (causing the slower conversion of a_datetime), but the above
>>> difference between between nanosecond and microsecond resolution is
>> largely
>>> due to pandas, not pyarrow (because pandas needs to convert the
>>> microseconds to nanoseconds, and during that conversion will also check
>>> that no datetimes were out of bounds for this resolution).
>>> 
>>> And in parquet, the datetime data of the index column will be stored in
>>> microsecond resolution (even if the original pandas data was nanosecond
>>> resolution). And the slower reading of the parquet file with datetime
>> index
>>> is thus almost entirely due to the above difference in timing of
>> converting
>>> the int or datetime index column to pandas.
>>> Parquet nowadays actually supports storing nanosecond resolution, and
>> this
>>> can be triggered in pyarrow by passing version="2.0" to pq.write_table
>> (but
>>> last what I heard this version is not yet considered production ready).
>>> 
>>> Joris
>>> 
>>> On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
>>> jorisvandenbossche@gmail.com> wrote:
>>> 
>>>> Hi Maarten,
>>>> 
>>>> Thanks for the reproducible script. I ran it on my laptop on pyarrow
>>>> master, and not seeing the difference between both datetime indexes:
>>>> 
>>>> Versions:
>>>> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
>>>> 23:01:00)
>>>> [GCC 7.3.0] on linux
>>>> numpy:    1.16.4
>>>> pandas:   0.26.0.dev0+447.gc168ecf26
>>>> pyarrow:  0.14.1.dev642+g7f2d637db
>>>> 
>>>> 1073741824 float64 8388608 16
>>>> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
>>>> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
>>>> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
>>>> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
>>>> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
>>>> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
>>>> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
>>>> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
>>>> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
>>>> 
>>>> The only change I needed to make in the script to get it running (within
>>>> my memory limits) was the creation of the second DatetimeIndex
>>>> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
>>>> 1.5 billion elements, while only the last part of it is used. So changed
>>>> that to index = pd.date_range('2018-01-01', '2019-09-01',
>>>> freq='S').array[-rows:])
>>>> 
>>>> The datetime index reading in general is still slower as the int index.
>>>> But doing a bit more detailed timings, and it seems this is not due to
>> the
>>>> reading of parquet, but the conversion of arrow to pandas (using the
>> files
>>>> from the benchmark):
>>>> 
>>>> In [1]: import pyarrow.parquet as pq
>>>> 
>>>> In [4]: %timeit pq.read_table('testdata.int.parquet')
>>>> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>> 
>>>> In [5]: %timeit pq.read_table('testdata.dt.parquet')
>>>> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>> 
>>>> In [6]: table_int = pq.read_table('testdata.int.parquet')
>>>> 
>>>> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
>>>> 
>>>> In [8]: %timeit table_int.to_pandas()
>>>> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>>> 
>>>> In [9]: %timeit table_datetime.to_pandas()
>>>> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>> 
>>>> So you can see that the parquet reading part is basically identical, but
>>>> the conversion to pandas is much slower for the datetime-index case.
>>>> I will try to look into that code path to see what makes this so much
>>>> slower.
>>>> 
>>>> Joris
>>>> 
>>>> 
>>>> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <ma...@xs4all.nl>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> The code to show the performance issue with DateTimeIndex is at:
>>>>> 
>>>>> 
>> https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
>>>>> 
>>>>> It shows three case 0) int index, 1) datetime index, 2) date time index
>>>>> created in a slightly roundabout way
>>>>> 
>>>>> I’m a little confused by the two date time cases. Case 2) is much
>> slower
>>>>> but the df compares identical to case 1)
>>>>> (I originally used something like 2) to match our specific data. I
>> don’t
>>>>> see why it behaves differently??)
>>>>> 
>>>>> The timings I find are:
>>>>> 
>>>>> 1073741824 float64 8388608 16
>>>>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
>>>>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
>>>>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
>>>>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
>>>>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
>>>>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
>>>>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
>>>>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
>>>>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
>>>>> 
>>>>> Case 0, int index.  This is all great.
>>>>> Case 1, date time index. We loose almost half the speed. Given that a
>>>>> datetime is only scaled from Pandas IIRC that seems like a lot?
>>>>> Case  3, other datetime index. No idea what is going on.
>>>>> 
>>>>> Any insights are much appreciated.
>>>>> 
>>>>> Cheers,
>>>>> Maarten.
>>>>> 
>>>>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> hi
>>>>>> 
>>>>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <
>> maartenb@xs4all.nl
>>>>> <ma...@xs4all.nl>> wrote:
>>>>>>> 
>>>>>>> Hi Wes,
>>>>>>> 
>>>>>>> Thanks for your quick response.
>>>>>>> 
>>>>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>>>>>>> 
>>>>>>> numpy:           1.16.5
>>>>>>> pandas:          0.25.1
>>>>>>> pyarrow:         0.14.1
>>>>>>> 
>>>>>>> It looks like 0.15 is close, so I can wait for that.
>>>>>>> 
>>>>>>> Theoretically I see three components driving the performance:
>>>>>>> 1) The cost of locating the column (directory overhead)
>>>>>>> 2) The overhead of reading a single column. (reading and processing
>>>>> meta data, setting up for reading)
>>>>>>> 3) Bulk reading and unmarshalling/decoding the data.
>>>>>>> 
>>>>>>> Only 1) would be impacted by the number of columns, but if you’re
>>>>> reading everything ideally this would not be a problem.
>>>>>> 
>>>>>> The problem is more nuanced than that. Parquet's metadata is somewhat
>>>>>> "heavy" at the column level. So when you're writing thousands of
>>>>>> columns, the fixed overhead associated with reading a single column
>>>>>> becomes problematic. There are several data structures associated with
>>>>>> decoding a column have a fixed setup and teardown cost. Even if there
>>>>>> is 1 millisecond of fixed overhead related to reading a column (I
>>>>>> don't know what the number is exactly) then reading 10,000 columns has
>>>>>> 10 seconds of unavoidable overhead. It might be useful for us to
>>>>>> quantify and communicate the expected overhead when metadata and
>>>>>> decoding is taken into account. Simply put having more than 1000
>>>>>> columns is not advisable.
>>>>>> 
>>>>>>> Based on an initial cursory look at the Parquet format I guess the
>>>>> index and the column meta-data might need to be read in full so I can
>> see
>>>>> how that might slow down reading only a few columns out of a large
>> set. But
>>>>> that was not really the case here?
>>>>>>> 
>>>>>>> What would you suggest for looking into the date index slow-down?
>>>>>> 
>>>>>> Can you show a code example to make things easier for us to see what
>>>>>> you're seeing?
>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Maarten.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> hi Maarten,
>>>>>>>> 
>>>>>>>> Are you using the master branch or 0.14.1? There are a number of
>>>>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
>>>>>>>> master branch, to appear as 0.15.0 relatively soon.
>>>>>>>> 
>>>>>>>> As a file format, Parquet (and columnar formats in general) is not
>>>>>>>> known to perform well with more than 1000 columns.
>>>>>>>> 
>>>>>>>> On the other items, we'd be happy to work with you to dig through
>> the
>>>>>>>> performance issues you're seeing.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Wes
>>>>>>>> 
>>>>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
>>>>> maartenb@xs4all.nl> wrote:
>>>>>>>>> 
>>>>>>>>> Greetings,
>>>>>>>>> 
>>>>>>>>> We have Pandas DataFrames with typically about 6,000 rows using
>>>>> DateTimeIndex.
>>>>>>>>> They have about 20,000 columns with integer column labels, and data
>>>>> with a dtype of float32.
>>>>>>>>> 
>>>>>>>>> We’d like to store these dataframes with parquet, using the ability
>>>>> to read a subset of columns and to store meta-data with the file.
>>>>>>>>> 
>>>>>>>>> We’ve found the reading performance less than expected compared to
>>>>> the published benchmarks (e.g. Wes’ blog post).
>>>>>>>>> 
>>>>>>>>> Using a modified version of his script we did reproduce his results
>>>>> (~ 1GB/s for high entropy, no dict on MacBook pro)
>>>>>>>>> 
>>>>>>>>> But there seem to be three factors that contribute to the slowdown
>>>>> for our datasets:
>>>>>>>>> 
>>>>>>>>> - DateTimeIndex is much slower then a Int index (we see about a
>>>>> factor 5).
>>>>>>>>> - The number of columns impact reading speed significantly (factor
>>>>> ~2 going from 16 to 16,000 columns)
>>>>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly
>>>>> and appears unnecessary? (about 40%)
>>>>>>>>> 
>>>>>>>>> Are there ways we could speedup the reading? Should we use a
>>>>> different layout?
>>>>>>>>> 
>>>>>>>>> Thanks for your help and insights!
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Maarten
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ps. the routines we used:
>>>>>>>>> 
>>>>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>>>>>>>>> table = pa.Table.from_pandas(df)
>>>>>>>>> pq.write_table(table, fname, use_dictionary=False,
>>>>> compression=None)
>>>>>>>>> return
>>>>>>>>> 
>>>>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>>>>>>>>> table = pq.read_table(fname, use_pandas_metadata=False,
>>>>> use_threads=True)
>>>>>>>>> df = table.to_pandas()
>>>>>>>>> return df
>>>>> 
>>>>> 
>> 
>> 


Re: Parquet file reading performance

Posted by Joris Van den Bossche <jo...@gmail.com>.
Some answers to the other questions:

On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn <ma...@xs4all.nl> wrote:

> ...
> This leaves me with the following questions:
>
> - Who should I talk to to get this resolved in Pandas?
>
> You can open an issue on their tracker:
https://github.com/pandas-dev/pandas/issues/


> - Do you think it would be possible to take the DateTime column out of
> Arrow into numpy
> and transform it the to make it more amenable to Pandas? and possibly even
> for the value columns?
>

The reason that this non-ns column takes more time is due to the conversion
to nanoseconds.
So one way this could be to already cast your arrow table to ns before
converting to pandas (this is implemented in arrow, but currently does not
check of out of bounds values; I opened
https://issues.apache.org/jira/browse/ARROW-6704 for this).

But, I am not sure it can be improved with a big factor on the pandas side.
Let's look at some timings using:

arr_int = np.arange(100000)
arr_ns = np.asarray(pd.date_range("2012", periods=100000, freq='S'))
arr_us = arr_ns.astype("datetime64[us]")

conversion from those arrays to a pandas Series only takes more time for
the microsecond datetimes:

In [6]: %timeit pd.Series(arr_int)
47.2 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: %timeit pd.Series(arr_ns)
27.7 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [8]: %timeit pd.Series(arr_us)
6.22 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But just converting microseconds to nanoseconds also already takes almost a
millisecond on this data (with numpy, but arrow is not faster):

In [15]: %timeit arr_us.astype("datetime64[ns]")
742 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So it should certainly be possible to get those 6ms down to something
around 1ms (and which can be a nice improvement! you're welcome to open an
issue on the pandas issue tracker for it), but it will still be factor of
20 to 30 slower than the nanosecond case.

Joris


> > On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
> >
> > From looking a little bit further into this, it seems that it is mainly
> > pandas who is slower in creating a Series from an array of datetime64
> > compared from an array of ints.
> > And especially if it is not nanosecond resolution:
> >
> > In [29]: a_int = pa.array(np.arange(100000))
> >
> > In [30]: %timeit a_int.to_pandas()
> > 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> >
> > In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
> > freq='S'))
> >
> > In [32]: %timeit a_datetime.to_pandas()
> > 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> >
> > In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
> > freq='S'), pa.timestamp('us'))
> >
> > In [34]: %timeit a_datetime_us.to_pandas()
> > 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >
> > Creating the datetime64 array inside pyarrow is also a bit slower
> compared
> > to int (causing the slower conversion of a_datetime), but the above
> > difference between between nanosecond and microsecond resolution is
> largely
> > due to pandas, not pyarrow (because pandas needs to convert the
> > microseconds to nanoseconds, and during that conversion will also check
> > that no datetimes were out of bounds for this resolution).
> >
> > And in parquet, the datetime data of the index column will be stored in
> > microsecond resolution (even if the original pandas data was nanosecond
> > resolution). And the slower reading of the parquet file with datetime
> index
> > is thus almost entirely due to the above difference in timing of
> converting
> > the int or datetime index column to pandas.
> > Parquet nowadays actually supports storing nanosecond resolution, and
> this
> > can be triggered in pyarrow by passing version="2.0" to pq.write_table
> (but
> > last what I heard this version is not yet considered production ready).
> >
> > Joris
> >
> > On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
> > jorisvandenbossche@gmail.com> wrote:
> >
> >> Hi Maarten,
> >>
> >> Thanks for the reproducible script. I ran it on my laptop on pyarrow
> >> master, and not seeing the difference between both datetime indexes:
> >>
> >> Versions:
> >> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
> >> 23:01:00)
> >> [GCC 7.3.0] on linux
> >> numpy:    1.16.4
> >> pandas:   0.26.0.dev0+447.gc168ecf26
> >> pyarrow:  0.14.1.dev642+g7f2d637db
> >>
> >> 1073741824 float64 8388608 16
> >> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
> >> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
> >> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
> >> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
> >> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
> >> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
> >> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
> >> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
> >> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
> >>
> >> The only change I needed to make in the script to get it running (within
> >> my memory limits) was the creation of the second DatetimeIndex
> >> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
> >> 1.5 billion elements, while only the last part of it is used. So changed
> >> that to index = pd.date_range('2018-01-01', '2019-09-01',
> >> freq='S').array[-rows:])
> >>
> >> The datetime index reading in general is still slower as the int index.
> >> But doing a bit more detailed timings, and it seems this is not due to
> the
> >> reading of parquet, but the conversion of arrow to pandas (using the
> files
> >> from the benchmark):
> >>
> >> In [1]: import pyarrow.parquet as pq
> >>
> >> In [4]: %timeit pq.read_table('testdata.int.parquet')
> >> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> In [5]: %timeit pq.read_table('testdata.dt.parquet')
> >> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> In [6]: table_int = pq.read_table('testdata.int.parquet')
> >>
> >> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
> >>
> >> In [8]: %timeit table_int.to_pandas()
> >> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>
> >> In [9]: %timeit table_datetime.to_pandas()
> >> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>
> >> So you can see that the parquet reading part is basically identical, but
> >> the conversion to pandas is much slower for the datetime-index case.
> >> I will try to look into that code path to see what makes this so much
> >> slower.
> >>
> >> Joris
> >>
> >>
> >> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <ma...@xs4all.nl>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> The code to show the performance issue with DateTimeIndex is at:
> >>>
> >>>
> https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
> >>>
> >>> It shows three case 0) int index, 1) datetime index, 2) date time index
> >>> created in a slightly roundabout way
> >>>
> >>> I’m a little confused by the two date time cases. Case 2) is much
> slower
> >>> but the df compares identical to case 1)
> >>> (I originally used something like 2) to match our specific data. I
> don’t
> >>> see why it behaves differently??)
> >>>
> >>> The timings I find are:
> >>>
> >>> 1073741824 float64 8388608 16
> >>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
> >>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
> >>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
> >>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
> >>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
> >>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
> >>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
> >>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
> >>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
> >>>
> >>> Case 0, int index.  This is all great.
> >>> Case 1, date time index. We loose almost half the speed. Given that a
> >>> datetime is only scaled from Pandas IIRC that seems like a lot?
> >>> Case  3, other datetime index. No idea what is going on.
> >>>
> >>> Any insights are much appreciated.
> >>>
> >>> Cheers,
> >>> Maarten.
> >>>
> >>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <we...@gmail.com>
> wrote:
> >>>>
> >>>> hi
> >>>>
> >>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <
> maartenb@xs4all.nl
> >>> <ma...@xs4all.nl>> wrote:
> >>>>>
> >>>>> Hi Wes,
> >>>>>
> >>>>> Thanks for your quick response.
> >>>>>
> >>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
> >>>>>
> >>>>> numpy:           1.16.5
> >>>>> pandas:          0.25.1
> >>>>> pyarrow:         0.14.1
> >>>>>
> >>>>> It looks like 0.15 is close, so I can wait for that.
> >>>>>
> >>>>> Theoretically I see three components driving the performance:
> >>>>> 1) The cost of locating the column (directory overhead)
> >>>>> 2) The overhead of reading a single column. (reading and processing
> >>> meta data, setting up for reading)
> >>>>> 3) Bulk reading and unmarshalling/decoding the data.
> >>>>>
> >>>>> Only 1) would be impacted by the number of columns, but if you’re
> >>> reading everything ideally this would not be a problem.
> >>>>
> >>>> The problem is more nuanced than that. Parquet's metadata is somewhat
> >>>> "heavy" at the column level. So when you're writing thousands of
> >>>> columns, the fixed overhead associated with reading a single column
> >>>> becomes problematic. There are several data structures associated with
> >>>> decoding a column have a fixed setup and teardown cost. Even if there
> >>>> is 1 millisecond of fixed overhead related to reading a column (I
> >>>> don't know what the number is exactly) then reading 10,000 columns has
> >>>> 10 seconds of unavoidable overhead. It might be useful for us to
> >>>> quantify and communicate the expected overhead when metadata and
> >>>> decoding is taken into account. Simply put having more than 1000
> >>>> columns is not advisable.
> >>>>
> >>>>> Based on an initial cursory look at the Parquet format I guess the
> >>> index and the column meta-data might need to be read in full so I can
> see
> >>> how that might slow down reading only a few columns out of a large
> set. But
> >>> that was not really the case here?
> >>>>>
> >>>>> What would you suggest for looking into the date index slow-down?
> >>>>
> >>>> Can you show a code example to make things easier for us to see what
> >>>> you're seeing?
> >>>>
> >>>>>
> >>>>> Cheers,
> >>>>> Maarten.
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>> hi Maarten,
> >>>>>>
> >>>>>> Are you using the master branch or 0.14.1? There are a number of
> >>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
> >>>>>> master branch, to appear as 0.15.0 relatively soon.
> >>>>>>
> >>>>>> As a file format, Parquet (and columnar formats in general) is not
> >>>>>> known to perform well with more than 1000 columns.
> >>>>>>
> >>>>>> On the other items, we'd be happy to work with you to dig through
> the
> >>>>>> performance issues you're seeing.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Wes
> >>>>>>
> >>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
> >>> maartenb@xs4all.nl> wrote:
> >>>>>>>
> >>>>>>> Greetings,
> >>>>>>>
> >>>>>>> We have Pandas DataFrames with typically about 6,000 rows using
> >>> DateTimeIndex.
> >>>>>>> They have about 20,000 columns with integer column labels, and data
> >>> with a dtype of float32.
> >>>>>>>
> >>>>>>> We’d like to store these dataframes with parquet, using the ability
> >>> to read a subset of columns and to store meta-data with the file.
> >>>>>>>
> >>>>>>> We’ve found the reading performance less than expected compared to
> >>> the published benchmarks (e.g. Wes’ blog post).
> >>>>>>>
> >>>>>>> Using a modified version of his script we did reproduce his results
> >>> (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>>>>>>
> >>>>>>> But there seem to be three factors that contribute to the slowdown
> >>> for our datasets:
> >>>>>>>
> >>>>>>> - DateTimeIndex is much slower then a Int index (we see about a
> >>> factor 5).
> >>>>>>> - The number of columns impact reading speed significantly (factor
> >>> ~2 going from 16 to 16,000 columns)
> >>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly
> >>> and appears unnecessary? (about 40%)
> >>>>>>>
> >>>>>>> Are there ways we could speedup the reading? Should we use a
> >>> different layout?
> >>>>>>>
> >>>>>>> Thanks for your help and insights!
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Maarten
> >>>>>>>
> >>>>>>>
> >>>>>>> ps. the routines we used:
> >>>>>>>
> >>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>>>>>>  table = pa.Table.from_pandas(df)
> >>>>>>>  pq.write_table(table, fname, use_dictionary=False,
> >>> compression=None)
> >>>>>>>  return
> >>>>>>>
> >>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>>>>>>  table = pq.read_table(fname, use_pandas_metadata=False,
> >>> use_threads=True)
> >>>>>>>  df = table.to_pandas()
> >>>>>>>  return df
> >>>
> >>>
>
>

Re: Parquet file reading performance

Posted by Maarten Ballintijn <ma...@xs4all.nl>.
Hi Joris, 

Thanks for your detailed analysis!

We can leave the impact of the large DateTimeIndex on the performance for another time.
(Notes: my laptop has sufficient memory to support it, no error is thrown, the
resulting DateTimeIndex from the expression is identical to your version or the other version
in the test. The large DateTimeIndex is released long before the tests happen, yet it has a
massive impact?? It feels like something is broken)


Thanks for clearly demonstrating that the main the issue is with to_pandas()
That’s very unexpected, in the ’ns’ case I would expect no overhead.
And even with the ‘us’ case it's only two vector compares and a factor multiply, no?
Also, Timestamps are quite ubiquitous :-)


This leaves me with the following questions:

- Who should I talk to to get this resolved in Pandas?

- Where do I find out more about Parquet v2? And more generally is there an RFC (or similar)
document that defines the Parquet file format and API?

- Do you think it would be possible to take the DateTime column out of Arrow into numpy
and transform it the to make it more amenable to Pandas? and possibly even for the value columns?

Thanks again and have a great weekend!
Maarten.




> On Sep 25, 2019, at 11:57 AM, Joris Van den Bossche <jo...@gmail.com> wrote:
> 
> From looking a little bit further into this, it seems that it is mainly
> pandas who is slower in creating a Series from an array of datetime64
> compared from an array of ints.
> And especially if it is not nanosecond resolution:
> 
> In [29]: a_int = pa.array(np.arange(100000))
> 
> In [30]: %timeit a_int.to_pandas()
> 56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 
> In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
> freq='S'))
> 
> In [32]: %timeit a_datetime.to_pandas()
> 1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> 
> In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
> freq='S'), pa.timestamp('us'))
> 
> In [34]: %timeit a_datetime_us.to_pandas()
> 7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> 
> Creating the datetime64 array inside pyarrow is also a bit slower compared
> to int (causing the slower conversion of a_datetime), but the above
> difference between between nanosecond and microsecond resolution is largely
> due to pandas, not pyarrow (because pandas needs to convert the
> microseconds to nanoseconds, and during that conversion will also check
> that no datetimes were out of bounds for this resolution).
> 
> And in parquet, the datetime data of the index column will be stored in
> microsecond resolution (even if the original pandas data was nanosecond
> resolution). And the slower reading of the parquet file with datetime index
> is thus almost entirely due to the above difference in timing of converting
> the int or datetime index column to pandas.
> Parquet nowadays actually supports storing nanosecond resolution, and this
> can be triggered in pyarrow by passing version="2.0" to pq.write_table (but
> last what I heard this version is not yet considered production ready).
> 
> Joris
> 
> On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
> 
>> Hi Maarten,
>> 
>> Thanks for the reproducible script. I ran it on my laptop on pyarrow
>> master, and not seeing the difference between both datetime indexes:
>> 
>> Versions:
>> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
>> 23:01:00)
>> [GCC 7.3.0] on linux
>> numpy:    1.16.4
>> pandas:   0.26.0.dev0+447.gc168ecf26
>> pyarrow:  0.14.1.dev642+g7f2d637db
>> 
>> 1073741824 float64 8388608 16
>> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
>> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
>> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
>> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
>> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
>> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
>> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
>> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
>> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
>> 
>> The only change I needed to make in the script to get it running (within
>> my memory limits) was the creation of the second DatetimeIndex
>> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
>> 1.5 billion elements, while only the last part of it is used. So changed
>> that to index = pd.date_range('2018-01-01', '2019-09-01',
>> freq='S').array[-rows:])
>> 
>> The datetime index reading in general is still slower as the int index.
>> But doing a bit more detailed timings, and it seems this is not due to the
>> reading of parquet, but the conversion of arrow to pandas (using the files
>> from the benchmark):
>> 
>> In [1]: import pyarrow.parquet as pq
>> 
>> In [4]: %timeit pq.read_table('testdata.int.parquet')
>> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>> 
>> In [5]: %timeit pq.read_table('testdata.dt.parquet')
>> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>> 
>> In [6]: table_int = pq.read_table('testdata.int.parquet')
>> 
>> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
>> 
>> In [8]: %timeit table_int.to_pandas()
>> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> 
>> In [9]: %timeit table_datetime.to_pandas()
>> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>> 
>> So you can see that the parquet reading part is basically identical, but
>> the conversion to pandas is much slower for the datetime-index case.
>> I will try to look into that code path to see what makes this so much
>> slower.
>> 
>> Joris
>> 
>> 
>> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <ma...@xs4all.nl>
>> wrote:
>> 
>>> Hi,
>>> 
>>> The code to show the performance issue with DateTimeIndex is at:
>>> 
>>>        https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
>>> 
>>> It shows three case 0) int index, 1) datetime index, 2) date time index
>>> created in a slightly roundabout way
>>> 
>>> I’m a little confused by the two date time cases. Case 2) is much slower
>>> but the df compares identical to case 1)
>>> (I originally used something like 2) to match our specific data. I don’t
>>> see why it behaves differently??)
>>> 
>>> The timings I find are:
>>> 
>>> 1073741824 float64 8388608 16
>>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
>>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
>>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
>>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
>>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
>>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
>>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
>>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
>>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
>>> 
>>> Case 0, int index.  This is all great.
>>> Case 1, date time index. We loose almost half the speed. Given that a
>>> datetime is only scaled from Pandas IIRC that seems like a lot?
>>> Case  3, other datetime index. No idea what is going on.
>>> 
>>> Any insights are much appreciated.
>>> 
>>> Cheers,
>>> Maarten.
>>> 
>>>> On Sep 24, 2019, at 11:25 AM, Wes McKinney <we...@gmail.com> wrote:
>>>> 
>>>> hi
>>>> 
>>>> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maartenb@xs4all.nl
>>> <ma...@xs4all.nl>> wrote:
>>>>> 
>>>>> Hi Wes,
>>>>> 
>>>>> Thanks for your quick response.
>>>>> 
>>>>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>>>>> 
>>>>> numpy:           1.16.5
>>>>> pandas:          0.25.1
>>>>> pyarrow:         0.14.1
>>>>> 
>>>>> It looks like 0.15 is close, so I can wait for that.
>>>>> 
>>>>> Theoretically I see three components driving the performance:
>>>>> 1) The cost of locating the column (directory overhead)
>>>>> 2) The overhead of reading a single column. (reading and processing
>>> meta data, setting up for reading)
>>>>> 3) Bulk reading and unmarshalling/decoding the data.
>>>>> 
>>>>> Only 1) would be impacted by the number of columns, but if you’re
>>> reading everything ideally this would not be a problem.
>>>> 
>>>> The problem is more nuanced than that. Parquet's metadata is somewhat
>>>> "heavy" at the column level. So when you're writing thousands of
>>>> columns, the fixed overhead associated with reading a single column
>>>> becomes problematic. There are several data structures associated with
>>>> decoding a column have a fixed setup and teardown cost. Even if there
>>>> is 1 millisecond of fixed overhead related to reading a column (I
>>>> don't know what the number is exactly) then reading 10,000 columns has
>>>> 10 seconds of unavoidable overhead. It might be useful for us to
>>>> quantify and communicate the expected overhead when metadata and
>>>> decoding is taken into account. Simply put having more than 1000
>>>> columns is not advisable.
>>>> 
>>>>> Based on an initial cursory look at the Parquet format I guess the
>>> index and the column meta-data might need to be read in full so I can see
>>> how that might slow down reading only a few columns out of a large set. But
>>> that was not really the case here?
>>>>> 
>>>>> What would you suggest for looking into the date index slow-down?
>>>> 
>>>> Can you show a code example to make things easier for us to see what
>>>> you're seeing?
>>>> 
>>>>> 
>>>>> Cheers,
>>>>> Maarten.
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> hi Maarten,
>>>>>> 
>>>>>> Are you using the master branch or 0.14.1? There are a number of
>>>>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
>>>>>> master branch, to appear as 0.15.0 relatively soon.
>>>>>> 
>>>>>> As a file format, Parquet (and columnar formats in general) is not
>>>>>> known to perform well with more than 1000 columns.
>>>>>> 
>>>>>> On the other items, we'd be happy to work with you to dig through the
>>>>>> performance issues you're seeing.
>>>>>> 
>>>>>> Thanks
>>>>>> Wes
>>>>>> 
>>>>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
>>> maartenb@xs4all.nl> wrote:
>>>>>>> 
>>>>>>> Greetings,
>>>>>>> 
>>>>>>> We have Pandas DataFrames with typically about 6,000 rows using
>>> DateTimeIndex.
>>>>>>> They have about 20,000 columns with integer column labels, and data
>>> with a dtype of float32.
>>>>>>> 
>>>>>>> We’d like to store these dataframes with parquet, using the ability
>>> to read a subset of columns and to store meta-data with the file.
>>>>>>> 
>>>>>>> We’ve found the reading performance less than expected compared to
>>> the published benchmarks (e.g. Wes’ blog post).
>>>>>>> 
>>>>>>> Using a modified version of his script we did reproduce his results
>>> (~ 1GB/s for high entropy, no dict on MacBook pro)
>>>>>>> 
>>>>>>> But there seem to be three factors that contribute to the slowdown
>>> for our datasets:
>>>>>>> 
>>>>>>> - DateTimeIndex is much slower then a Int index (we see about a
>>> factor 5).
>>>>>>> - The number of columns impact reading speed significantly (factor
>>> ~2 going from 16 to 16,000 columns)
>>>>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly
>>> and appears unnecessary? (about 40%)
>>>>>>> 
>>>>>>> Are there ways we could speedup the reading? Should we use a
>>> different layout?
>>>>>>> 
>>>>>>> Thanks for your help and insights!
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Maarten
>>>>>>> 
>>>>>>> 
>>>>>>> ps. the routines we used:
>>>>>>> 
>>>>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>>>>>>>  table = pa.Table.from_pandas(df)
>>>>>>>  pq.write_table(table, fname, use_dictionary=False,
>>> compression=None)
>>>>>>>  return
>>>>>>> 
>>>>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>>>>>>>  table = pq.read_table(fname, use_pandas_metadata=False,
>>> use_threads=True)
>>>>>>>  df = table.to_pandas()
>>>>>>>  return df
>>> 
>>> 


Re: Parquet file reading performance

Posted by Joris Van den Bossche <jo...@gmail.com>.
From looking a little bit further into this, it seems that it is mainly
pandas who is slower in creating a Series from an array of datetime64
compared from an array of ints.
And especially if it is not nanosecond resolution:

In [29]: a_int = pa.array(np.arange(100000))

In [30]: %timeit a_int.to_pandas()
56.7 µs ± 299 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [31]: a_datetime = pa.array(pd.date_range("2012", periods=100000,
freq='S'))

In [32]: %timeit a_datetime.to_pandas()
1.94 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [33]: a_datetime_us = pa.array(pd.date_range("2012", periods=100000,
freq='S'), pa.timestamp('us'))

In [34]: %timeit a_datetime_us.to_pandas()
7.78 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Creating the datetime64 array inside pyarrow is also a bit slower compared
to int (causing the slower conversion of a_datetime), but the above
difference between between nanosecond and microsecond resolution is largely
due to pandas, not pyarrow (because pandas needs to convert the
microseconds to nanoseconds, and during that conversion will also check
that no datetimes were out of bounds for this resolution).

And in parquet, the datetime data of the index column will be stored in
microsecond resolution (even if the original pandas data was nanosecond
resolution). And the slower reading of the parquet file with datetime index
is thus almost entirely due to the above difference in timing of converting
the int or datetime index column to pandas.
Parquet nowadays actually supports storing nanosecond resolution, and this
can be triggered in pyarrow by passing version="2.0" to pq.write_table (but
last what I heard this version is not yet considered production ready).

Joris

On Wed, 25 Sep 2019 at 16:03, Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Maarten,
>
> Thanks for the reproducible script. I ran it on my laptop on pyarrow
> master, and not seeing the difference between both datetime indexes:
>
> Versions:
> Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
> 23:01:00)
> [GCC 7.3.0] on linux
> numpy:    1.16.4
> pandas:   0.26.0.dev0+447.gc168ecf26
> pyarrow:  0.14.1.dev642+g7f2d637db
>
> 1073741824 float64 8388608 16
> 0: make_dataframe                 :   1443.483 msec,  709 MB/s
> 0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
> 0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
> 1: make_dataframe                 :   1412.575 msec,  725 MB/s
> 1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
> 1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
> 2: make_dataframe                 :   1490.165 msec,  687 MB/s
> 2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
> 2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<
>
> The only change I needed to make in the script to get it running (within
> my memory limits) was the creation of the second DatetimeIndex
> (pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
> 1.5 billion elements, while only the last part of it is used. So changed
> that to index = pd.date_range('2018-01-01', '2019-09-01',
> freq='S').array[-rows:])
>
> The datetime index reading in general is still slower as the int index.
> But doing a bit more detailed timings, and it seems this is not due to the
> reading of parquet, but the conversion of arrow to pandas (using the files
> from the benchmark):
>
> In [1]: import pyarrow.parquet as pq
>
> In [4]: %timeit pq.read_table('testdata.int.parquet')
> 41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> In [5]: %timeit pq.read_table('testdata.dt.parquet')
> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> In [6]: table_int = pq.read_table('testdata.int.parquet')
>
> In [7]: table_datetime = pq.read_table('testdata.dt.parquet')
>
> In [8]: %timeit table_int.to_pandas()
> 14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> In [9]: %timeit table_datetime.to_pandas()
> 47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> So you can see that the parquet reading part is basically identical, but
> the conversion to pandas is much slower for the datetime-index case.
> I will try to look into that code path to see what makes this so much
> slower.
>
> Joris
>
>
> On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <ma...@xs4all.nl>
> wrote:
>
>> Hi,
>>
>> The code to show the performance issue with DateTimeIndex is at:
>>
>>         https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
>>
>> It shows three case 0) int index, 1) datetime index, 2) date time index
>> created in a slightly roundabout way
>>
>> I’m a little confused by the two date time cases. Case 2) is much slower
>> but the df compares identical to case 1)
>> (I originally used something like 2) to match our specific data. I don’t
>> see why it behaves differently??)
>>
>> The timings I find are:
>>
>> 1073741824 float64 8388608 16
>> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
>> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
>> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
>> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
>> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
>> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
>> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
>> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
>> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
>>
>> Case 0, int index.  This is all great.
>> Case 1, date time index. We loose almost half the speed. Given that a
>> datetime is only scaled from Pandas IIRC that seems like a lot?
>> Case  3, other datetime index. No idea what is going on.
>>
>> Any insights are much appreciated.
>>
>> Cheers,
>> Maarten.
>>
>> > On Sep 24, 2019, at 11:25 AM, Wes McKinney <we...@gmail.com> wrote:
>> >
>> > hi
>> >
>> > On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maartenb@xs4all.nl
>> <ma...@xs4all.nl>> wrote:
>> >>
>> >> Hi Wes,
>> >>
>> >> Thanks for your quick response.
>> >>
>> >> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>> >>
>> >> numpy:           1.16.5
>> >> pandas:          0.25.1
>> >> pyarrow:         0.14.1
>> >>
>> >> It looks like 0.15 is close, so I can wait for that.
>> >>
>> >> Theoretically I see three components driving the performance:
>> >> 1) The cost of locating the column (directory overhead)
>> >> 2) The overhead of reading a single column. (reading and processing
>> meta data, setting up for reading)
>> >> 3) Bulk reading and unmarshalling/decoding the data.
>> >>
>> >> Only 1) would be impacted by the number of columns, but if you’re
>> reading everything ideally this would not be a problem.
>> >
>> > The problem is more nuanced than that. Parquet's metadata is somewhat
>> > "heavy" at the column level. So when you're writing thousands of
>> > columns, the fixed overhead associated with reading a single column
>> > becomes problematic. There are several data structures associated with
>> > decoding a column have a fixed setup and teardown cost. Even if there
>> > is 1 millisecond of fixed overhead related to reading a column (I
>> > don't know what the number is exactly) then reading 10,000 columns has
>> > 10 seconds of unavoidable overhead. It might be useful for us to
>> > quantify and communicate the expected overhead when metadata and
>> > decoding is taken into account. Simply put having more than 1000
>> > columns is not advisable.
>> >
>> >> Based on an initial cursory look at the Parquet format I guess the
>> index and the column meta-data might need to be read in full so I can see
>> how that might slow down reading only a few columns out of a large set. But
>> that was not really the case here?
>> >>
>> >> What would you suggest for looking into the date index slow-down?
>> >
>> > Can you show a code example to make things easier for us to see what
>> > you're seeing?
>> >
>> >>
>> >> Cheers,
>> >> Maarten.
>> >>
>> >>
>> >>
>> >>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >>>
>> >>> hi Maarten,
>> >>>
>> >>> Are you using the master branch or 0.14.1? There are a number of
>> >>> performance regressions in 0.14.0/0.14.1 that are addressed in the
>> >>> master branch, to appear as 0.15.0 relatively soon.
>> >>>
>> >>> As a file format, Parquet (and columnar formats in general) is not
>> >>> known to perform well with more than 1000 columns.
>> >>>
>> >>> On the other items, we'd be happy to work with you to dig through the
>> >>> performance issues you're seeing.
>> >>>
>> >>> Thanks
>> >>> Wes
>> >>>
>> >>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <
>> maartenb@xs4all.nl> wrote:
>> >>>>
>> >>>> Greetings,
>> >>>>
>> >>>> We have Pandas DataFrames with typically about 6,000 rows using
>> DateTimeIndex.
>> >>>> They have about 20,000 columns with integer column labels, and data
>> with a dtype of float32.
>> >>>>
>> >>>> We’d like to store these dataframes with parquet, using the ability
>> to read a subset of columns and to store meta-data with the file.
>> >>>>
>> >>>> We’ve found the reading performance less than expected compared to
>> the published benchmarks (e.g. Wes’ blog post).
>> >>>>
>> >>>> Using a modified version of his script we did reproduce his results
>> (~ 1GB/s for high entropy, no dict on MacBook pro)
>> >>>>
>> >>>> But there seem to be three factors that contribute to the slowdown
>> for our datasets:
>> >>>>
>> >>>> - DateTimeIndex is much slower then a Int index (we see about a
>> factor 5).
>> >>>> - The number of columns impact reading speed significantly (factor
>> ~2 going from 16 to 16,000 columns)
>> >>>> - The ‘use_pandas_metadata=True’ slows down reading significantly
>> and appears unnecessary? (about 40%)
>> >>>>
>> >>>> Are there ways we could speedup the reading? Should we use a
>> different layout?
>> >>>>
>> >>>> Thanks for your help and insights!
>> >>>>
>> >>>> Cheers,
>> >>>> Maarten
>> >>>>
>> >>>>
>> >>>> ps. the routines we used:
>> >>>>
>> >>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>> >>>>   table = pa.Table.from_pandas(df)
>> >>>>   pq.write_table(table, fname, use_dictionary=False,
>> compression=None)
>> >>>>   return
>> >>>>
>> >>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>> >>>>   table = pq.read_table(fname, use_pandas_metadata=False,
>> use_threads=True)
>> >>>>   df = table.to_pandas()
>> >>>>   return df
>>
>>

Re: Parquet file reading performance

Posted by Joris Van den Bossche <jo...@gmail.com>.
Hi Maarten,

Thanks for the reproducible script. I ran it on my laptop on pyarrow
master, and not seeing the difference between both datetime indexes:

Versions:
Python:   3.7.3 | packaged by conda-forge | (default, Mar 27 2019,
23:01:00)
[GCC 7.3.0] on linux
numpy:    1.16.4
pandas:   0.26.0.dev0+447.gc168ecf26
pyarrow:  0.14.1.dev642+g7f2d637db

1073741824 float64 8388608 16
0: make_dataframe                 :   1443.483 msec,  709 MB/s
0: write_arrow_parquet            :   7685.426 msec,  133 MB/s
0: read_arrow_parquet             :   1262.741 msec,  811 MB/s <<<
1: make_dataframe                 :   1412.575 msec,  725 MB/s
1: write_arrow_parquet            :   7869.145 msec,  130 MB/s
1: read_arrow_parquet             :   1947.896 msec,  526 MB/s <<<
2: make_dataframe                 :   1490.165 msec,  687 MB/s
2: write_arrow_parquet            :   7040.507 msec,  145 MB/s
2: read_arrow_parquet             :   1888.316 msec,  542 MB/s <<<

The only change I needed to make in the script to get it running (within my
memory limits) was the creation of the second DatetimeIndex
(pd.date_range('1970-01-01', '2019-09-01', freq='S') creates an index of
1.5 billion elements, while only the last part of it is used. So changed
that to index = pd.date_range('2018-01-01', '2019-09-01',
freq='S').array[-rows:])

The datetime index reading in general is still slower as the int index. But
doing a bit more detailed timings, and it seems this is not due to the
reading of parquet, but the conversion of arrow to pandas (using the files
from the benchmark):

In [1]: import pyarrow.parquet as pq

In [4]: %timeit pq.read_table('testdata.int.parquet')
41.5 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit pq.read_table('testdata.dt.parquet')
43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: table_int = pq.read_table('testdata.int.parquet')

In [7]: table_datetime = pq.read_table('testdata.dt.parquet')

In [8]: %timeit table_int.to_pandas()
14.3 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit table_datetime.to_pandas()
47.2 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So you can see that the parquet reading part is basically identical, but
the conversion to pandas is much slower for the datetime-index case.
I will try to look into that code path to see what makes this so much
slower.

Joris


On Tue, 24 Sep 2019 at 22:28, Maarten Ballintijn <ma...@xs4all.nl> wrote:

> Hi,
>
> The code to show the performance issue with DateTimeIndex is at:
>
>         https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
>
> It shows three case 0) int index, 1) datetime index, 2) date time index
> created in a slightly roundabout way
>
> I’m a little confused by the two date time cases. Case 2) is much slower
> but the df compares identical to case 1)
> (I originally used something like 2) to match our specific data. I don’t
> see why it behaves differently??)
>
> The timings I find are:
>
> 1073741824 float64 8388608 16
> 0: make_dataframe                 :   2390.830 msec,  428 MB/s
> 0: write_arrow_parquet            :   2486.463 msec,  412 MB/s
> 0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
> 1: make_dataframe                 :   2579.815 msec,  397 MB/s
> 1: write_arrow_parquet            :   2708.151 msec,  378 MB/s
> 1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
> 2: make_dataframe                 :  15126.520 msec,  68 MB/s
> 2: write_arrow_parquet            :   9205.815 msec,  111 MB/s
> 2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<
>
> Case 0, int index.  This is all great.
> Case 1, date time index. We loose almost half the speed. Given that a
> datetime is only scaled from Pandas IIRC that seems like a lot?
> Case  3, other datetime index. No idea what is going on.
>
> Any insights are much appreciated.
>
> Cheers,
> Maarten.
>
> > On Sep 24, 2019, at 11:25 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > hi
> >
> > On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maartenb@xs4all.nl
> <ma...@xs4all.nl>> wrote:
> >>
> >> Hi Wes,
> >>
> >> Thanks for your quick response.
> >>
> >> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
> >>
> >> numpy:           1.16.5
> >> pandas:          0.25.1
> >> pyarrow:         0.14.1
> >>
> >> It looks like 0.15 is close, so I can wait for that.
> >>
> >> Theoretically I see three components driving the performance:
> >> 1) The cost of locating the column (directory overhead)
> >> 2) The overhead of reading a single column. (reading and processing
> meta data, setting up for reading)
> >> 3) Bulk reading and unmarshalling/decoding the data.
> >>
> >> Only 1) would be impacted by the number of columns, but if you’re
> reading everything ideally this would not be a problem.
> >
> > The problem is more nuanced than that. Parquet's metadata is somewhat
> > "heavy" at the column level. So when you're writing thousands of
> > columns, the fixed overhead associated with reading a single column
> > becomes problematic. There are several data structures associated with
> > decoding a column have a fixed setup and teardown cost. Even if there
> > is 1 millisecond of fixed overhead related to reading a column (I
> > don't know what the number is exactly) then reading 10,000 columns has
> > 10 seconds of unavoidable overhead. It might be useful for us to
> > quantify and communicate the expected overhead when metadata and
> > decoding is taken into account. Simply put having more than 1000
> > columns is not advisable.
> >
> >> Based on an initial cursory look at the Parquet format I guess the
> index and the column meta-data might need to be read in full so I can see
> how that might slow down reading only a few columns out of a large set. But
> that was not really the case here?
> >>
> >> What would you suggest for looking into the date index slow-down?
> >
> > Can you show a code example to make things easier for us to see what
> > you're seeing?
> >
> >>
> >> Cheers,
> >> Maarten.
> >>
> >>
> >>
> >>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com> wrote:
> >>>
> >>> hi Maarten,
> >>>
> >>> Are you using the master branch or 0.14.1? There are a number of
> >>> performance regressions in 0.14.0/0.14.1 that are addressed in the
> >>> master branch, to appear as 0.15.0 relatively soon.
> >>>
> >>> As a file format, Parquet (and columnar formats in general) is not
> >>> known to perform well with more than 1000 columns.
> >>>
> >>> On the other items, we'd be happy to work with you to dig through the
> >>> performance issues you're seeing.
> >>>
> >>> Thanks
> >>> Wes
> >>>
> >>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <ma...@xs4all.nl>
> wrote:
> >>>>
> >>>> Greetings,
> >>>>
> >>>> We have Pandas DataFrames with typically about 6,000 rows using
> DateTimeIndex.
> >>>> They have about 20,000 columns with integer column labels, and data
> with a dtype of float32.
> >>>>
> >>>> We’d like to store these dataframes with parquet, using the ability
> to read a subset of columns and to store meta-data with the file.
> >>>>
> >>>> We’ve found the reading performance less than expected compared to
> the published benchmarks (e.g. Wes’ blog post).
> >>>>
> >>>> Using a modified version of his script we did reproduce his results
> (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>>>
> >>>> But there seem to be three factors that contribute to the slowdown
> for our datasets:
> >>>>
> >>>> - DateTimeIndex is much slower then a Int index (we see about a
> factor 5).
> >>>> - The number of columns impact reading speed significantly (factor ~2
> going from 16 to 16,000 columns)
> >>>> - The ‘use_pandas_metadata=True’ slows down reading significantly and
> appears unnecessary? (about 40%)
> >>>>
> >>>> Are there ways we could speedup the reading? Should we use a
> different layout?
> >>>>
> >>>> Thanks for your help and insights!
> >>>>
> >>>> Cheers,
> >>>> Maarten
> >>>>
> >>>>
> >>>> ps. the routines we used:
> >>>>
> >>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>>>   table = pa.Table.from_pandas(df)
> >>>>   pq.write_table(table, fname, use_dictionary=False, compression=None)
> >>>>   return
> >>>>
> >>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>>>   table = pq.read_table(fname, use_pandas_metadata=False,
> use_threads=True)
> >>>>   df = table.to_pandas()
> >>>>   return df
>
>

Re: Parquet file reading performance

Posted by Maarten Ballintijn <ma...@xs4all.nl>.
Hi,

The code to show the performance issue with DateTimeIndex is at:
	
	https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c

It shows three case 0) int index, 1) datetime index, 2) date time index created in a slightly roundabout way

I’m a little confused by the two date time cases. Case 2) is much slower but the df compares identical to case 1)
(I originally used something like 2) to match our specific data. I don’t see why it behaves differently??)

The timings I find are:

1073741824 float64 8388608 16
0: make_dataframe                 :   2390.830 msec,  428 MB/s 
0: write_arrow_parquet            :   2486.463 msec,  412 MB/s 
0: read_arrow_parquet             :    813.946 msec,  1258 MB/s <<<
1: make_dataframe                 :   2579.815 msec,  397 MB/s 
1: write_arrow_parquet            :   2708.151 msec,  378 MB/s 
1: read_arrow_parquet             :   1413.999 msec,  724 MB/s <<<
2: make_dataframe                 :  15126.520 msec,  68 MB/s 
2: write_arrow_parquet            :   9205.815 msec,  111 MB/s 
2: read_arrow_parquet             :   5929.346 msec,  173 MB/s <<<

Case 0, int index.  This is all great.
Case 1, date time index. We loose almost half the speed. Given that a datetime is only scaled from Pandas IIRC that seems like a lot?
Case  3, other datetime index. No idea what is going on.

Any insights are much appreciated.

Cheers,
Maarten.

> On Sep 24, 2019, at 11:25 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> hi
> 
> On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maartenb@xs4all.nl <ma...@xs4all.nl>> wrote:
>> 
>> Hi Wes,
>> 
>> Thanks for your quick response.
>> 
>> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>> 
>> numpy:           1.16.5
>> pandas:          0.25.1
>> pyarrow:         0.14.1
>> 
>> It looks like 0.15 is close, so I can wait for that.
>> 
>> Theoretically I see three components driving the performance:
>> 1) The cost of locating the column (directory overhead)
>> 2) The overhead of reading a single column. (reading and processing meta data, setting up for reading)
>> 3) Bulk reading and unmarshalling/decoding the data.
>> 
>> Only 1) would be impacted by the number of columns, but if you’re reading everything ideally this would not be a problem.
> 
> The problem is more nuanced than that. Parquet's metadata is somewhat
> "heavy" at the column level. So when you're writing thousands of
> columns, the fixed overhead associated with reading a single column
> becomes problematic. There are several data structures associated with
> decoding a column have a fixed setup and teardown cost. Even if there
> is 1 millisecond of fixed overhead related to reading a column (I
> don't know what the number is exactly) then reading 10,000 columns has
> 10 seconds of unavoidable overhead. It might be useful for us to
> quantify and communicate the expected overhead when metadata and
> decoding is taken into account. Simply put having more than 1000
> columns is not advisable.
> 
>> Based on an initial cursory look at the Parquet format I guess the index and the column meta-data might need to be read in full so I can see how that might slow down reading only a few columns out of a large set. But that was not really the case here?
>> 
>> What would you suggest for looking into the date index slow-down?
> 
> Can you show a code example to make things easier for us to see what
> you're seeing?
> 
>> 
>> Cheers,
>> Maarten.
>> 
>> 
>> 
>>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com> wrote:
>>> 
>>> hi Maarten,
>>> 
>>> Are you using the master branch or 0.14.1? There are a number of
>>> performance regressions in 0.14.0/0.14.1 that are addressed in the
>>> master branch, to appear as 0.15.0 relatively soon.
>>> 
>>> As a file format, Parquet (and columnar formats in general) is not
>>> known to perform well with more than 1000 columns.
>>> 
>>> On the other items, we'd be happy to work with you to dig through the
>>> performance issues you're seeing.
>>> 
>>> Thanks
>>> Wes
>>> 
>>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <ma...@xs4all.nl> wrote:
>>>> 
>>>> Greetings,
>>>> 
>>>> We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
>>>> They have about 20,000 columns with integer column labels, and data with a dtype of float32.
>>>> 
>>>> We’d like to store these dataframes with parquet, using the ability to read a subset of columns and to store meta-data with the file.
>>>> 
>>>> We’ve found the reading performance less than expected compared to the published benchmarks (e.g. Wes’ blog post).
>>>> 
>>>> Using a modified version of his script we did reproduce his results (~ 1GB/s for high entropy, no dict on MacBook pro)
>>>> 
>>>> But there seem to be three factors that contribute to the slowdown for our datasets:
>>>> 
>>>> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
>>>> - The number of columns impact reading speed significantly (factor ~2 going from 16 to 16,000 columns)
>>>> - The ‘use_pandas_metadata=True’ slows down reading significantly and appears unnecessary? (about 40%)
>>>> 
>>>> Are there ways we could speedup the reading? Should we use a different layout?
>>>> 
>>>> Thanks for your help and insights!
>>>> 
>>>> Cheers,
>>>> Maarten
>>>> 
>>>> 
>>>> ps. the routines we used:
>>>> 
>>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>>>>   table = pa.Table.from_pandas(df)
>>>>   pq.write_table(table, fname, use_dictionary=False, compression=None)
>>>>   return
>>>> 
>>>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>>>>   table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
>>>>   df = table.to_pandas()
>>>>   return df


Re: Parquet file reading performance

Posted by Wes McKinney <we...@gmail.com>.
hi

On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <ma...@xs4all.nl> wrote:
>
> Hi Wes,
>
> Thanks for your quick response.
>
> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:
>
> numpy:           1.16.5
> pandas:          0.25.1
> pyarrow:         0.14.1
>
> It looks like 0.15 is close, so I can wait for that.
>
> Theoretically I see three components driving the performance:
> 1) The cost of locating the column (directory overhead)
> 2) The overhead of reading a single column. (reading and processing meta data, setting up for reading)
> 3) Bulk reading and unmarshalling/decoding the data.
>
> Only 1) would be impacted by the number of columns, but if you’re reading everything ideally this would not be a problem.

The problem is more nuanced than that. Parquet's metadata is somewhat
"heavy" at the column level. So when you're writing thousands of
columns, the fixed overhead associated with reading a single column
becomes problematic. There are several data structures associated with
decoding a column have a fixed setup and teardown cost. Even if there
is 1 millisecond of fixed overhead related to reading a column (I
don't know what the number is exactly) then reading 10,000 columns has
10 seconds of unavoidable overhead. It might be useful for us to
quantify and communicate the expected overhead when metadata and
decoding is taken into account. Simply put having more than 1000
columns is not advisable.

> Based on an initial cursory look at the Parquet format I guess the index and the column meta-data might need to be read in full so I can see how that  might slow down reading only a few columns out of a large set. But that was not really the case here?
>
> What would you suggest for looking into the date index slow-down?

Can you show a code example to make things easier for us to see what
you're seeing?

>
> Cheers,
> Maarten.
>
>
>
> > On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com> wrote:
> >
> > hi Maarten,
> >
> > Are you using the master branch or 0.14.1? There are a number of
> > performance regressions in 0.14.0/0.14.1 that are addressed in the
> > master branch, to appear as 0.15.0 relatively soon.
> >
> > As a file format, Parquet (and columnar formats in general) is not
> > known to perform well with more than 1000 columns.
> >
> > On the other items, we'd be happy to work with you to dig through the
> > performance issues you're seeing.
> >
> > Thanks
> > Wes
> >
> > On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <ma...@xs4all.nl> wrote:
> >>
> >> Greetings,
> >>
> >> We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
> >> They have about 20,000 columns with integer column labels, and data with a dtype of float32.
> >>
> >> We’d like to store these dataframes with parquet, using the ability to read a subset of columns and to store meta-data with the file.
> >>
> >> We’ve found the reading performance less than expected compared to the published benchmarks (e.g. Wes’ blog post).
> >>
> >> Using a modified version of his script we did reproduce his results (~ 1GB/s for high entropy, no dict on MacBook pro)
> >>
> >> But there seem to be three factors that contribute to the slowdown for our datasets:
> >>
> >> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
> >> - The number of columns impact reading speed significantly (factor ~2 going from 16 to 16,000 columns)
> >> - The ‘use_pandas_metadata=True’ slows down reading significantly and appears unnecessary? (about 40%)
> >>
> >> Are there ways we could speedup the reading? Should we use a different layout?
> >>
> >> Thanks for your help and insights!
> >>
> >> Cheers,
> >> Maarten
> >>
> >>
> >> ps. the routines we used:
> >>
> >> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
> >>    table = pa.Table.from_pandas(df)
> >>    pq.write_table(table, fname, use_dictionary=False, compression=None)
> >>    return
> >>
> >> def read_arrow_parquet(fname: str) -> pd.DataFrame:
> >>    table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
> >>    df = table.to_pandas()
> >>    return df
> >>
> >>
>

Re: Parquet file reading performance

Posted by Maarten Ballintijn <ma...@xs4all.nl>.
Hi Wes,

Thanks for your quick response.

Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and:

numpy:           1.16.5
pandas:          0.25.1
pyarrow:         0.14.1

It looks like 0.15 is close, so I can wait for that.

Theoretically I see three components driving the performance:
1) The cost of locating the column (directory overhead)
2) The overhead of reading a single column. (reading and processing meta data, setting up for reading)
3) Bulk reading and unmarshalling/decoding the data.

Only 1) would be impacted by the number of columns, but if you’re reading everything ideally this would not be a problem. 

Based on an initial cursory look at the Parquet format I guess the index and the column meta-data might need to be read in full so I can see how that  might slow down reading only a few columns out of a large set. But that was not really the case here?

What would you suggest for looking into the date index slow-down?

Cheers,
Maarten.



> On Sep 23, 2019, at 7:07 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> hi Maarten,
> 
> Are you using the master branch or 0.14.1? There are a number of
> performance regressions in 0.14.0/0.14.1 that are addressed in the
> master branch, to appear as 0.15.0 relatively soon.
> 
> As a file format, Parquet (and columnar formats in general) is not
> known to perform well with more than 1000 columns.
> 
> On the other items, we'd be happy to work with you to dig through the
> performance issues you're seeing.
> 
> Thanks
> Wes
> 
> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <ma...@xs4all.nl> wrote:
>> 
>> Greetings,
>> 
>> We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
>> They have about 20,000 columns with integer column labels, and data with a dtype of float32.
>> 
>> We’d like to store these dataframes with parquet, using the ability to read a subset of columns and to store meta-data with the file.
>> 
>> We’ve found the reading performance less than expected compared to the published benchmarks (e.g. Wes’ blog post).
>> 
>> Using a modified version of his script we did reproduce his results (~ 1GB/s for high entropy, no dict on MacBook pro)
>> 
>> But there seem to be three factors that contribute to the slowdown for our datasets:
>> 
>> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
>> - The number of columns impact reading speed significantly (factor ~2 going from 16 to 16,000 columns)
>> - The ‘use_pandas_metadata=True’ slows down reading significantly and appears unnecessary? (about 40%)
>> 
>> Are there ways we could speedup the reading? Should we use a different layout?
>> 
>> Thanks for your help and insights!
>> 
>> Cheers,
>> Maarten
>> 
>> 
>> ps. the routines we used:
>> 
>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>>    table = pa.Table.from_pandas(df)
>>    pq.write_table(table, fname, use_dictionary=False, compression=None)
>>    return
>> 
>> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>>    table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
>>    df = table.to_pandas()
>>    return df
>> 
>> 


Re: Parquet file reading performance

Posted by Wes McKinney <we...@gmail.com>.
hi Maarten,

Are you using the master branch or 0.14.1? There are a number of
performance regressions in 0.14.0/0.14.1 that are addressed in the
master branch, to appear as 0.15.0 relatively soon.

As a file format, Parquet (and columnar formats in general) is not
known to perform well with more than 1000 columns.

On the other items, we'd be happy to work with you to dig through the
performance issues you're seeing.

Thanks
Wes

On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <ma...@xs4all.nl> wrote:
>
> Greetings,
>
> We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
> They have about 20,000 columns with integer column labels, and data with a dtype of float32.
>
> We’d like to store these dataframes with parquet, using the ability to read a subset of columns and to store meta-data with the file.
>
> We’ve found the reading performance less than expected compared to the published benchmarks (e.g. Wes’ blog post).
>
> Using a modified version of his script we did reproduce his results (~ 1GB/s for high entropy, no dict on MacBook pro)
>
> But there seem to be three factors that contribute to the slowdown for our datasets:
>
> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
> - The number of columns impact reading speed significantly (factor ~2 going from 16 to 16,000 columns)
> - The ‘use_pandas_metadata=True’ slows down reading significantly and appears unnecessary? (about 40%)
>
> Are there ways we could speedup the reading? Should we use a different layout?
>
> Thanks for your help and insights!
>
> Cheers,
> Maarten
>
>
> ps. the routines we used:
>
> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>     table = pa.Table.from_pandas(df)
>     pq.write_table(table, fname, use_dictionary=False, compression=None)
>     return
>
> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>     table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
>     df = table.to_pandas()
>     return df
>
>

Re: Parquet file reading performance

Posted by Wes McKinney <we...@gmail.com>.
hi Maarten,

Are you using the master branch or 0.14.1? There are a number of
performance regressions in 0.14.0/0.14.1 that are addressed in the
master branch, to appear as 0.15.0 relatively soon.

As a file format, Parquet (and columnar formats in general) is not
known to perform well with more than 1000 columns.

On the other items, we'd be happy to work with you to dig through the
performance issues you're seeing.

Thanks
Wes

On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <ma...@xs4all.nl> wrote:
>
> Greetings,
>
> We have Pandas DataFrames with typically about 6,000 rows using DateTimeIndex.
> They have about 20,000 columns with integer column labels, and data with a dtype of float32.
>
> We’d like to store these dataframes with parquet, using the ability to read a subset of columns and to store meta-data with the file.
>
> We’ve found the reading performance less than expected compared to the published benchmarks (e.g. Wes’ blog post).
>
> Using a modified version of his script we did reproduce his results (~ 1GB/s for high entropy, no dict on MacBook pro)
>
> But there seem to be three factors that contribute to the slowdown for our datasets:
>
> - DateTimeIndex is much slower then a Int index (we see about a factor 5).
> - The number of columns impact reading speed significantly (factor ~2 going from 16 to 16,000 columns)
> - The ‘use_pandas_metadata=True’ slows down reading significantly and appears unnecessary? (about 40%)
>
> Are there ways we could speedup the reading? Should we use a different layout?
>
> Thanks for your help and insights!
>
> Cheers,
> Maarten
>
>
> ps. the routines we used:
>
> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None:
>     table = pa.Table.from_pandas(df)
>     pq.write_table(table, fname, use_dictionary=False, compression=None)
>     return
>
> def read_arrow_parquet(fname: str) -> pd.DataFrame:
>     table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True)
>     df = table.to_pandas()
>     return df
>
>