You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Joris Peeters <jo...@gmail.com> on 2021/07/13 16:12:15 UTC

parquet performance for wide tables (many columns)

Hello,

Sending to user@arrow, as that appears the best place for parquet questions
atm, but feel free to redirect me.

My objective is to store financial data in Parquet files, and read it out
fast.
The columns represent stocks (~= 10,000 or so), and each row is a date (~=
8000, e.g. 30 years). Values are e.g. settlement prices. I might want to
use short row groups of e.g. a year each, for quickly getting to smaller
date ranges, or query for a subset of columns (stocks).

The appeal of parquet is that I could store all of this stuff in one file,
and use the row-groups + column-select for slicing, rather than have a ton
of smaller files etc. Would also integrate well with various ML tech.

When doing some basic performance testing, with random data, I noticed that
the performance for tables with many columns seems fairly poor. I've
attached a little benchmark script - see output at the bottom.

Stylised conslusions,
- Reading/writing a "tall" (nrows >> ncols) dataframe is *much* more
performant than a "wide" dataframe.
- with the Arrow format (as opposed to parquet), the difference is much
smaller.
- Similar results on Windows & Linux, and for Arrow's parquet vs
fastparquet.

Is there something pathological about the parquet format that manifests in
this regime, or is it rather that the code might not have been optimised
for this? Aware that ncols >> nrows is not ideal, but was hoping for less
of a cliff.

Happy to dig in, but polling experts first.

Best,
-J

>python benchmark.py
2021-07-13 16:31:54.786 INFO     Writing parquet to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
2021-07-13 16:31:55.123 INFO     Written.
2021-07-13 16:31:55.123 INFO     Writing parquet to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
2021-07-13 16:31:57.155 INFO     Written.
2021-07-13 16:31:57.155 INFO     Writing parquet to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq
[FastParquet]
2021-07-13 16:31:57.789 INFO     Written.
2021-07-13 16:31:57.790 INFO     Writing parquet to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq
[FastParquet]
2021-07-13 16:32:03.613 INFO     Written.
2021-07-13 16:32:03.613 INFO     Reading parquet from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
2021-07-13 16:32:03.890 INFO     Read.
2021-07-13 16:32:03.899 INFO     Reading parquet from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
2021-07-13 16:32:08.727 INFO     Read.
2021-07-13 16:32:08.737 INFO     Reading parquet from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq
[FastParquet]
2021-07-13 16:32:08.983 INFO     Read.
2021-07-13 16:32:08.991 INFO     Reading parquet from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq
[FastParquet]
2021-07-13 16:32:11.580 INFO     Read.
2021-07-13 16:32:11.589 INFO     Writing Arrow to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
2021-07-13 16:32:13.057 INFO     Arrow written.
2021-07-13 16:32:13.078 INFO     Writing Arrow to
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
2021-07-13 16:32:13.425 INFO     Arrow written.
2021-07-13 16:32:13.434 INFO     Reading Arrow from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
2021-07-13 16:32:13.620 INFO     Read.
2021-07-13 16:32:13.637 INFO     Reading Arrow from
C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
2021-07-13 16:32:13.711 INFO     Read.

Re: parquet performance for wide tables (many columns)

Posted by Joris Peeters <jo...@gmail.com>.
That's awesome. I've compiled master locally and can indeed confirm the
huge performance improvement.
Thanks!
-J

On Wed, Jul 14, 2021 at 1:01 PM Weston Pace <we...@gmail.com> wrote:

> My mistake, the attachment was there. I simply did not check for it
> very thoroughly.
>
> I played around with this a bit more and realized the difference was
> simply that I was running on the latest build and not using 4.0.1.
> Digging a bit further I think you are seeing [1] and there is a bit
> more discussion you may be interested in at [2].  The good news is
> that you should see much better performance with 5.0.0 (releasing
> soon).  It isn't quite the 15% I was seeing in my own benchmark, more
> like 3.5x slower for reading.  Rerunning your benchmark script against
> the latest master I get the following output:
>
> ```
> 2021-07-14 01:53:38.706 INFO     Writing parquet to
> /tmp/tmpk8zj9k1u/example_tall_apq.pq [Arrow]
> 2021-07-14 01:53:38.906 INFO     Written.
> 2021-07-14 01:53:38.906 INFO     Writing parquet to
> /tmp/tmpk8zj9k1u/example_wide_apq.pq [Arrow]
> 2021-07-14 01:53:39.688 INFO     Written.
> 2021-07-14 01:53:39.688 INFO     Writing parquet to
> /tmp/tmpk8zj9k1u/example_tall_fpq.pq [FastParquet]
> 2021-07-14 01:53:40.077 INFO     Written.
> 2021-07-14 01:53:40.077 INFO     Writing parquet to
> /tmp/tmpk8zj9k1u/example_wide_fpq.pq [FastParquet]
> 2021-07-14 01:53:43.376 INFO     Written.
> 2021-07-14 01:53:43.377 INFO     Reading parquet from
> /tmp/tmpk8zj9k1u/example_tall_apq.pq [Arrow]
> 2021-07-14 01:53:43.473 INFO     Read.
> 2021-07-14 01:53:43.474 INFO     Reading parquet from
> /tmp/tmpk8zj9k1u/example_wide_apq.pq [Arrow]
> 2021-07-14 01:53:43.825 INFO     Read.
> 2021-07-14 01:53:43.826 INFO     Reading parquet from
> /tmp/tmpk8zj9k1u/example_tall_apq.pq [FastParquet]
> 2021-07-14 01:53:43.975 INFO     Read.
> 2021-07-14 01:53:43.975 INFO     Reading parquet from
> /tmp/tmpk8zj9k1u/example_wide_apq.pq [FastParquet]
> 2021-07-14 01:53:45.012 INFO     Read.
> 2021-07-14 01:53:45.013 INFO     Writing Arrow to
> /tmp/tmpk8zj9k1u/example_wide.arrows
> 2021-07-14 01:53:45.581 INFO     Arrow written.
> 2021-07-14 01:53:45.585 INFO     Writing Arrow to
> /tmp/tmpk8zj9k1u/example_tall.arrows
> 2021-07-14 01:53:45.675 INFO     Arrow written.
> 2021-07-14 01:53:45.676 INFO     Reading Arrow from
> /tmp/tmpk8zj9k1u/example_wide.arrows
> 2021-07-14 01:53:45.780 INFO     Read.
> 2021-07-14 01:53:45.783 INFO     Reading Arrow from
> /tmp/tmpk8zj9k1u/example_tall.arrows
> 2021-07-14 01:53:45.796 INFO     Read.
> ```
>
> [1] https://issues.apache.org/jira/browse/ARROW-12736
> [2] https://issues.apache.org/jira/browse/ARROW-11469
>
> On Tue, Jul 13, 2021 at 11:06 PM Joris Peeters
> <jo...@gmail.com> wrote:
> >
> > I added the script as `benchmark.py` in my original post. Maybe it got
> filtered somewhere, but
> https://lists.apache.org/api/email.lua?attachment=true&id=r2a7b4fe367184aabfb335fe0a5dc1d2a871ed52d51047b8130bf1fb5@%3Cuser.arrow.apache.org%3E&file=133b9b7a2693cd02fd841cee58b2bfab059f22f17d377df0c62c573d5aa09fb3
> might be a stable link.
> >
> > So, interestingly, I am reproducing my own findings with your script.
> The tall table takes about 0.8s to load (as you also found), but the wide
> one takes 6.4s. I'm surprised you see ~=0.8s for the wide read as well.
> Throughout both my own benchmarks and yours, reading a wide one was always
> significantly slower, on different machines and Windows/Linux - and even
> across different parquet implementations (Arrow's <-> fastparquet).
> >
> > On Tue, Jul 13, 2021 at 7:11 PM Weston Pace <we...@gmail.com>
> wrote:
> >>
> >> The short answer is no, there is nothing "pathological" about parquet,
> >> it should be more or less as suited for wide columns as arrow's IPC
> >> format.  Both formats will require additional metadata when there are
> >> more columns and compressibility may differ (although .arrows data is
> >> often uncompressed).
> >>
> >> Can you provide your test script?  I don't get quite the same results.
> >> For my test I created two tables, one that was 10,000 columns by 8,000
> >> rows and one that was 80,000,000 rows in 1 column.  There is simply
> >> more metadata when you have 10k rows and less opportunity for
> >> compression.  As a result the file sizes were 611M for the tall and
> >> 739M for the wide so the wide requires about 20% more data.  Reading
> >> times (hot-in-cache reads) were ~.73s for the tall and ~.84s for the
> >> wide and so the wide takes about 15% more time to read.  This seems
> >> about right to me.
> >>
> >> ## Writing script
> >>
> >> import pyarrow as pa
> >> import pyarrow.parquet as pq
> >> import numpy as np
> >>
> >> TALL_ROWS = 80_000_000
> >> TALL_COLS = 1
> >> WIDE_ROWS = 8_000
> >> WIDE_COLS = 10_000
> >>
> >> tall_data = np.random.rand(TALL_COLS, TALL_ROWS)
> >> wide_data = np.random.rand(WIDE_COLS, WIDE_ROWS)
> >>
> >> tall_table = pa.Table.from_arrays([tall_data[0]], names=["values"])
> >> pq.write_table(tall_table, '/tmp/tall.pq')
> >>
> >> wide_names = [f'f{i}' for i in range(WIDE_COLS)]
> >> wide_table = pa.Table.from_arrays(wide_data, names=wide_names)
> >> pq.write_table(wide_table, '/tmp/wide.pq')
> >>
> >> ## Reading script
> >>
> >> import pyarrow.parquet as pq
> >>
> >> table = pq.read_table('/tmp/tall.pq')
> >> print(table.num_rows)
> >> print(table.num_columns)
> >>
> >> On Tue, Jul 13, 2021 at 6:23 AM Martin Percossi <ma...@percossi.com>
> wrote:
> >> >
> >> > An alternative representation would be to have a single settlement
> price column, and add a stock_id column. Instead of a single row for each
> time step, you would now have, say, 10K rows - one for each stock.
> >> >
> >> > I think this will yield better performance.
> >> >
> >> > On Tue, 13 Jul 2021, 18:12 Joris Peeters, <jo...@gmail.com>
> wrote:
> >> >>
> >> >> Hello,
> >> >>
> >> >> Sending to user@arrow, as that appears the best place for parquet
> questions atm, but feel free to redirect me.
> >> >>
> >> >> My objective is to store financial data in Parquet files, and read
> it out fast.
> >> >> The columns represent stocks (~= 10,000 or so), and each row is a
> date (~= 8000, e.g. 30 years). Values are e.g. settlement prices. I might
> want to use short row groups of e.g. a year each, for quickly getting to
> smaller date ranges, or query for a subset of columns (stocks).
> >> >>
> >> >> The appeal of parquet is that I could store all of this stuff in one
> file, and use the row-groups + column-select for slicing, rather than have
> a ton of smaller files etc. Would also integrate well with various ML tech.
> >> >>
> >> >> When doing some basic performance testing, with random data, I
> noticed that the performance for tables with many columns seems fairly
> poor. I've attached a little benchmark script - see output at the bottom.
> >> >>
> >> >> Stylised conslusions,
> >> >> - Reading/writing a "tall" (nrows >> ncols) dataframe is much more
> performant than a "wide" dataframe.
> >> >> - with the Arrow format (as opposed to parquet), the difference is
> much smaller.
> >> >> - Similar results on Windows & Linux, and for Arrow's parquet vs
> fastparquet.
> >> >>
> >> >> Is there something pathological about the parquet format that
> manifests in this regime, or is it rather that the code might not have been
> optimised for this? Aware that ncols >> nrows is not ideal, but was hoping
> for less of a cliff.
> >> >>
> >> >> Happy to dig in, but polling experts first.
> >> >>
> >> >> Best,
> >> >> -J
> >> >>
> >> >> >python benchmark.py
> >> >> 2021-07-13 16:31:54.786 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
> >> >> 2021-07-13 16:31:55.123 INFO     Written.
> >> >> 2021-07-13 16:31:55.123 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
> >> >> 2021-07-13 16:31:57.155 INFO     Written.
> >> >> 2021-07-13 16:31:57.155 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq
> [FastParquet]
> >> >> 2021-07-13 16:31:57.789 INFO     Written.
> >> >> 2021-07-13 16:31:57.790 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq
> [FastParquet]
> >> >> 2021-07-13 16:32:03.613 INFO     Written.
> >> >> 2021-07-13 16:32:03.613 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
> >> >> 2021-07-13 16:32:03.890 INFO     Read.
> >> >> 2021-07-13 16:32:03.899 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
> >> >> 2021-07-13 16:32:08.727 INFO     Read.
> >> >> 2021-07-13 16:32:08.737 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq
> [FastParquet]
> >> >> 2021-07-13 16:32:08.983 INFO     Read.
> >> >> 2021-07-13 16:32:08.991 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq
> [FastParquet]
> >> >> 2021-07-13 16:32:11.580 INFO     Read.
> >> >> 2021-07-13 16:32:11.589 INFO     Writing Arrow to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
> >> >> 2021-07-13 16:32:13.057 INFO     Arrow written.
> >> >> 2021-07-13 16:32:13.078 INFO     Writing Arrow to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
> >> >> 2021-07-13 16:32:13.425 INFO     Arrow written.
> >> >> 2021-07-13 16:32:13.434 INFO     Reading Arrow from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
> >> >> 2021-07-13 16:32:13.620 INFO     Read.
> >> >> 2021-07-13 16:32:13.637 INFO     Reading Arrow from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
> >> >> 2021-07-13 16:32:13.711 INFO     Read.
> >> >>
> >> >>
>

Re: parquet performance for wide tables (many columns)

Posted by Weston Pace <we...@gmail.com>.
My mistake, the attachment was there. I simply did not check for it
very thoroughly.

I played around with this a bit more and realized the difference was
simply that I was running on the latest build and not using 4.0.1.
Digging a bit further I think you are seeing [1] and there is a bit
more discussion you may be interested in at [2].  The good news is
that you should see much better performance with 5.0.0 (releasing
soon).  It isn't quite the 15% I was seeing in my own benchmark, more
like 3.5x slower for reading.  Rerunning your benchmark script against
the latest master I get the following output:

```
2021-07-14 01:53:38.706 INFO     Writing parquet to
/tmp/tmpk8zj9k1u/example_tall_apq.pq [Arrow]
2021-07-14 01:53:38.906 INFO     Written.
2021-07-14 01:53:38.906 INFO     Writing parquet to
/tmp/tmpk8zj9k1u/example_wide_apq.pq [Arrow]
2021-07-14 01:53:39.688 INFO     Written.
2021-07-14 01:53:39.688 INFO     Writing parquet to
/tmp/tmpk8zj9k1u/example_tall_fpq.pq [FastParquet]
2021-07-14 01:53:40.077 INFO     Written.
2021-07-14 01:53:40.077 INFO     Writing parquet to
/tmp/tmpk8zj9k1u/example_wide_fpq.pq [FastParquet]
2021-07-14 01:53:43.376 INFO     Written.
2021-07-14 01:53:43.377 INFO     Reading parquet from
/tmp/tmpk8zj9k1u/example_tall_apq.pq [Arrow]
2021-07-14 01:53:43.473 INFO     Read.
2021-07-14 01:53:43.474 INFO     Reading parquet from
/tmp/tmpk8zj9k1u/example_wide_apq.pq [Arrow]
2021-07-14 01:53:43.825 INFO     Read.
2021-07-14 01:53:43.826 INFO     Reading parquet from
/tmp/tmpk8zj9k1u/example_tall_apq.pq [FastParquet]
2021-07-14 01:53:43.975 INFO     Read.
2021-07-14 01:53:43.975 INFO     Reading parquet from
/tmp/tmpk8zj9k1u/example_wide_apq.pq [FastParquet]
2021-07-14 01:53:45.012 INFO     Read.
2021-07-14 01:53:45.013 INFO     Writing Arrow to
/tmp/tmpk8zj9k1u/example_wide.arrows
2021-07-14 01:53:45.581 INFO     Arrow written.
2021-07-14 01:53:45.585 INFO     Writing Arrow to
/tmp/tmpk8zj9k1u/example_tall.arrows
2021-07-14 01:53:45.675 INFO     Arrow written.
2021-07-14 01:53:45.676 INFO     Reading Arrow from
/tmp/tmpk8zj9k1u/example_wide.arrows
2021-07-14 01:53:45.780 INFO     Read.
2021-07-14 01:53:45.783 INFO     Reading Arrow from
/tmp/tmpk8zj9k1u/example_tall.arrows
2021-07-14 01:53:45.796 INFO     Read.
```

[1] https://issues.apache.org/jira/browse/ARROW-12736
[2] https://issues.apache.org/jira/browse/ARROW-11469

On Tue, Jul 13, 2021 at 11:06 PM Joris Peeters
<jo...@gmail.com> wrote:
>
> I added the script as `benchmark.py` in my original post. Maybe it got filtered somewhere, but https://lists.apache.org/api/email.lua?attachment=true&id=r2a7b4fe367184aabfb335fe0a5dc1d2a871ed52d51047b8130bf1fb5@%3Cuser.arrow.apache.org%3E&file=133b9b7a2693cd02fd841cee58b2bfab059f22f17d377df0c62c573d5aa09fb3 might be a stable link.
>
> So, interestingly, I am reproducing my own findings with your script. The tall table takes about 0.8s to load (as you also found), but the wide one takes 6.4s. I'm surprised you see ~=0.8s for the wide read as well. Throughout both my own benchmarks and yours, reading a wide one was always significantly slower, on different machines and Windows/Linux - and even across different parquet implementations (Arrow's <-> fastparquet).
>
> On Tue, Jul 13, 2021 at 7:11 PM Weston Pace <we...@gmail.com> wrote:
>>
>> The short answer is no, there is nothing "pathological" about parquet,
>> it should be more or less as suited for wide columns as arrow's IPC
>> format.  Both formats will require additional metadata when there are
>> more columns and compressibility may differ (although .arrows data is
>> often uncompressed).
>>
>> Can you provide your test script?  I don't get quite the same results.
>> For my test I created two tables, one that was 10,000 columns by 8,000
>> rows and one that was 80,000,000 rows in 1 column.  There is simply
>> more metadata when you have 10k rows and less opportunity for
>> compression.  As a result the file sizes were 611M for the tall and
>> 739M for the wide so the wide requires about 20% more data.  Reading
>> times (hot-in-cache reads) were ~.73s for the tall and ~.84s for the
>> wide and so the wide takes about 15% more time to read.  This seems
>> about right to me.
>>
>> ## Writing script
>>
>> import pyarrow as pa
>> import pyarrow.parquet as pq
>> import numpy as np
>>
>> TALL_ROWS = 80_000_000
>> TALL_COLS = 1
>> WIDE_ROWS = 8_000
>> WIDE_COLS = 10_000
>>
>> tall_data = np.random.rand(TALL_COLS, TALL_ROWS)
>> wide_data = np.random.rand(WIDE_COLS, WIDE_ROWS)
>>
>> tall_table = pa.Table.from_arrays([tall_data[0]], names=["values"])
>> pq.write_table(tall_table, '/tmp/tall.pq')
>>
>> wide_names = [f'f{i}' for i in range(WIDE_COLS)]
>> wide_table = pa.Table.from_arrays(wide_data, names=wide_names)
>> pq.write_table(wide_table, '/tmp/wide.pq')
>>
>> ## Reading script
>>
>> import pyarrow.parquet as pq
>>
>> table = pq.read_table('/tmp/tall.pq')
>> print(table.num_rows)
>> print(table.num_columns)
>>
>> On Tue, Jul 13, 2021 at 6:23 AM Martin Percossi <ma...@percossi.com> wrote:
>> >
>> > An alternative representation would be to have a single settlement price column, and add a stock_id column. Instead of a single row for each time step, you would now have, say, 10K rows - one for each stock.
>> >
>> > I think this will yield better performance.
>> >
>> > On Tue, 13 Jul 2021, 18:12 Joris Peeters, <jo...@gmail.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> Sending to user@arrow, as that appears the best place for parquet questions atm, but feel free to redirect me.
>> >>
>> >> My objective is to store financial data in Parquet files, and read it out fast.
>> >> The columns represent stocks (~= 10,000 or so), and each row is a date (~= 8000, e.g. 30 years). Values are e.g. settlement prices. I might want to use short row groups of e.g. a year each, for quickly getting to smaller date ranges, or query for a subset of columns (stocks).
>> >>
>> >> The appeal of parquet is that I could store all of this stuff in one file, and use the row-groups + column-select for slicing, rather than have a ton of smaller files etc. Would also integrate well with various ML tech.
>> >>
>> >> When doing some basic performance testing, with random data, I noticed that the performance for tables with many columns seems fairly poor. I've attached a little benchmark script - see output at the bottom.
>> >>
>> >> Stylised conslusions,
>> >> - Reading/writing a "tall" (nrows >> ncols) dataframe is much more performant than a "wide" dataframe.
>> >> - with the Arrow format (as opposed to parquet), the difference is much smaller.
>> >> - Similar results on Windows & Linux, and for Arrow's parquet vs fastparquet.
>> >>
>> >> Is there something pathological about the parquet format that manifests in this regime, or is it rather that the code might not have been optimised for this? Aware that ncols >> nrows is not ideal, but was hoping for less of a cliff.
>> >>
>> >> Happy to dig in, but polling experts first.
>> >>
>> >> Best,
>> >> -J
>> >>
>> >> >python benchmark.py
>> >> 2021-07-13 16:31:54.786 INFO     Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
>> >> 2021-07-13 16:31:55.123 INFO     Written.
>> >> 2021-07-13 16:31:55.123 INFO     Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
>> >> 2021-07-13 16:31:57.155 INFO     Written.
>> >> 2021-07-13 16:31:57.155 INFO     Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq [FastParquet]
>> >> 2021-07-13 16:31:57.789 INFO     Written.
>> >> 2021-07-13 16:31:57.790 INFO     Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq [FastParquet]
>> >> 2021-07-13 16:32:03.613 INFO     Written.
>> >> 2021-07-13 16:32:03.613 INFO     Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
>> >> 2021-07-13 16:32:03.890 INFO     Read.
>> >> 2021-07-13 16:32:03.899 INFO     Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
>> >> 2021-07-13 16:32:08.727 INFO     Read.
>> >> 2021-07-13 16:32:08.737 INFO     Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [FastParquet]
>> >> 2021-07-13 16:32:08.983 INFO     Read.
>> >> 2021-07-13 16:32:08.991 INFO     Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [FastParquet]
>> >> 2021-07-13 16:32:11.580 INFO     Read.
>> >> 2021-07-13 16:32:11.589 INFO     Writing Arrow to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> >> 2021-07-13 16:32:13.057 INFO     Arrow written.
>> >> 2021-07-13 16:32:13.078 INFO     Writing Arrow to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> >> 2021-07-13 16:32:13.425 INFO     Arrow written.
>> >> 2021-07-13 16:32:13.434 INFO     Reading Arrow from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> >> 2021-07-13 16:32:13.620 INFO     Read.
>> >> 2021-07-13 16:32:13.637 INFO     Reading Arrow from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> >> 2021-07-13 16:32:13.711 INFO     Read.
>> >>
>> >>

Re: parquet performance for wide tables (many columns)

Posted by Joris Peeters <jo...@gmail.com>.
I added the script as `benchmark.py` in my original post. Maybe it got
filtered somewhere, but
https://lists.apache.org/api/email.lua?attachment=true&id=r2a7b4fe367184aabfb335fe0a5dc1d2a871ed52d51047b8130bf1fb5@%3Cuser.arrow.apache.org%3E&file=133b9b7a2693cd02fd841cee58b2bfab059f22f17d377df0c62c573d5aa09fb3
might be a stable link.

So, interestingly, I am reproducing my own findings with your script. The
tall table takes about 0.8s to load (as you also found), but the wide one
takes 6.4s. I'm surprised you see ~=0.8s for the wide read as well.
Throughout both my own benchmarks and yours, reading a wide one was always
significantly slower, on different machines and Windows/Linux - and even
across different parquet implementations (Arrow's <-> fastparquet).

On Tue, Jul 13, 2021 at 7:11 PM Weston Pace <we...@gmail.com> wrote:

> The short answer is no, there is nothing "pathological" about parquet,
> it should be more or less as suited for wide columns as arrow's IPC
> format.  Both formats will require additional metadata when there are
> more columns and compressibility may differ (although .arrows data is
> often uncompressed).
>
> Can you provide your test script?  I don't get quite the same results.
> For my test I created two tables, one that was 10,000 columns by 8,000
> rows and one that was 80,000,000 rows in 1 column.  There is simply
> more metadata when you have 10k rows and less opportunity for
> compression.  As a result the file sizes were 611M for the tall and
> 739M for the wide so the wide requires about 20% more data.  Reading
> times (hot-in-cache reads) were ~.73s for the tall and ~.84s for the
> wide and so the wide takes about 15% more time to read.  This seems
> about right to me.
>
> ## Writing script
>
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
>
> TALL_ROWS = 80_000_000
> TALL_COLS = 1
> WIDE_ROWS = 8_000
> WIDE_COLS = 10_000
>
> tall_data = np.random.rand(TALL_COLS, TALL_ROWS)
> wide_data = np.random.rand(WIDE_COLS, WIDE_ROWS)
>
> tall_table = pa.Table.from_arrays([tall_data[0]], names=["values"])
> pq.write_table(tall_table, '/tmp/tall.pq')
>
> wide_names = [f'f{i}' for i in range(WIDE_COLS)]
> wide_table = pa.Table.from_arrays(wide_data, names=wide_names)
> pq.write_table(wide_table, '/tmp/wide.pq')
>
> ## Reading script
>
> import pyarrow.parquet as pq
>
> table = pq.read_table('/tmp/tall.pq')
> print(table.num_rows)
> print(table.num_columns)
>
> On Tue, Jul 13, 2021 at 6:23 AM Martin Percossi <ma...@percossi.com>
> wrote:
> >
> > An alternative representation would be to have a single settlement price
> column, and add a stock_id column. Instead of a single row for each time
> step, you would now have, say, 10K rows - one for each stock.
> >
> > I think this will yield better performance.
> >
> > On Tue, 13 Jul 2021, 18:12 Joris Peeters, <jo...@gmail.com>
> wrote:
> >>
> >> Hello,
> >>
> >> Sending to user@arrow, as that appears the best place for parquet
> questions atm, but feel free to redirect me.
> >>
> >> My objective is to store financial data in Parquet files, and read it
> out fast.
> >> The columns represent stocks (~= 10,000 or so), and each row is a date
> (~= 8000, e.g. 30 years). Values are e.g. settlement prices. I might want
> to use short row groups of e.g. a year each, for quickly getting to smaller
> date ranges, or query for a subset of columns (stocks).
> >>
> >> The appeal of parquet is that I could store all of this stuff in one
> file, and use the row-groups + column-select for slicing, rather than have
> a ton of smaller files etc. Would also integrate well with various ML tech.
> >>
> >> When doing some basic performance testing, with random data, I noticed
> that the performance for tables with many columns seems fairly poor. I've
> attached a little benchmark script - see output at the bottom.
> >>
> >> Stylised conslusions,
> >> - Reading/writing a "tall" (nrows >> ncols) dataframe is much more
> performant than a "wide" dataframe.
> >> - with the Arrow format (as opposed to parquet), the difference is much
> smaller.
> >> - Similar results on Windows & Linux, and for Arrow's parquet vs
> fastparquet.
> >>
> >> Is there something pathological about the parquet format that manifests
> in this regime, or is it rather that the code might not have been optimised
> for this? Aware that ncols >> nrows is not ideal, but was hoping for less
> of a cliff.
> >>
> >> Happy to dig in, but polling experts first.
> >>
> >> Best,
> >> -J
> >>
> >> >python benchmark.py
> >> 2021-07-13 16:31:54.786 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
> >> 2021-07-13 16:31:55.123 INFO     Written.
> >> 2021-07-13 16:31:55.123 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
> >> 2021-07-13 16:31:57.155 INFO     Written.
> >> 2021-07-13 16:31:57.155 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq
> [FastParquet]
> >> 2021-07-13 16:31:57.789 INFO     Written.
> >> 2021-07-13 16:31:57.790 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq
> [FastParquet]
> >> 2021-07-13 16:32:03.613 INFO     Written.
> >> 2021-07-13 16:32:03.613 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
> >> 2021-07-13 16:32:03.890 INFO     Read.
> >> 2021-07-13 16:32:03.899 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
> >> 2021-07-13 16:32:08.727 INFO     Read.
> >> 2021-07-13 16:32:08.737 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq
> [FastParquet]
> >> 2021-07-13 16:32:08.983 INFO     Read.
> >> 2021-07-13 16:32:08.991 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq
> [FastParquet]
> >> 2021-07-13 16:32:11.580 INFO     Read.
> >> 2021-07-13 16:32:11.589 INFO     Writing Arrow to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
> >> 2021-07-13 16:32:13.057 INFO     Arrow written.
> >> 2021-07-13 16:32:13.078 INFO     Writing Arrow to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
> >> 2021-07-13 16:32:13.425 INFO     Arrow written.
> >> 2021-07-13 16:32:13.434 INFO     Reading Arrow from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
> >> 2021-07-13 16:32:13.620 INFO     Read.
> >> 2021-07-13 16:32:13.637 INFO     Reading Arrow from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
> >> 2021-07-13 16:32:13.711 INFO     Read.
> >>
> >>
>

Re: parquet performance for wide tables (many columns)

Posted by Weston Pace <we...@gmail.com>.
The short answer is no, there is nothing "pathological" about parquet,
it should be more or less as suited for wide columns as arrow's IPC
format.  Both formats will require additional metadata when there are
more columns and compressibility may differ (although .arrows data is
often uncompressed).

Can you provide your test script?  I don't get quite the same results.
For my test I created two tables, one that was 10,000 columns by 8,000
rows and one that was 80,000,000 rows in 1 column.  There is simply
more metadata when you have 10k rows and less opportunity for
compression.  As a result the file sizes were 611M for the tall and
739M for the wide so the wide requires about 20% more data.  Reading
times (hot-in-cache reads) were ~.73s for the tall and ~.84s for the
wide and so the wide takes about 15% more time to read.  This seems
about right to me.

## Writing script

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

TALL_ROWS = 80_000_000
TALL_COLS = 1
WIDE_ROWS = 8_000
WIDE_COLS = 10_000

tall_data = np.random.rand(TALL_COLS, TALL_ROWS)
wide_data = np.random.rand(WIDE_COLS, WIDE_ROWS)

tall_table = pa.Table.from_arrays([tall_data[0]], names=["values"])
pq.write_table(tall_table, '/tmp/tall.pq')

wide_names = [f'f{i}' for i in range(WIDE_COLS)]
wide_table = pa.Table.from_arrays(wide_data, names=wide_names)
pq.write_table(wide_table, '/tmp/wide.pq')

## Reading script

import pyarrow.parquet as pq

table = pq.read_table('/tmp/tall.pq')
print(table.num_rows)
print(table.num_columns)

On Tue, Jul 13, 2021 at 6:23 AM Martin Percossi <ma...@percossi.com> wrote:
>
> An alternative representation would be to have a single settlement price column, and add a stock_id column. Instead of a single row for each time step, you would now have, say, 10K rows - one for each stock.
>
> I think this will yield better performance.
>
> On Tue, 13 Jul 2021, 18:12 Joris Peeters, <jo...@gmail.com> wrote:
>>
>> Hello,
>>
>> Sending to user@arrow, as that appears the best place for parquet questions atm, but feel free to redirect me.
>>
>> My objective is to store financial data in Parquet files, and read it out fast.
>> The columns represent stocks (~= 10,000 or so), and each row is a date (~= 8000, e.g. 30 years). Values are e.g. settlement prices. I might want to use short row groups of e.g. a year each, for quickly getting to smaller date ranges, or query for a subset of columns (stocks).
>>
>> The appeal of parquet is that I could store all of this stuff in one file, and use the row-groups + column-select for slicing, rather than have a ton of smaller files etc. Would also integrate well with various ML tech.
>>
>> When doing some basic performance testing, with random data, I noticed that the performance for tables with many columns seems fairly poor. I've attached a little benchmark script - see output at the bottom.
>>
>> Stylised conslusions,
>> - Reading/writing a "tall" (nrows >> ncols) dataframe is much more performant than a "wide" dataframe.
>> - with the Arrow format (as opposed to parquet), the difference is much smaller.
>> - Similar results on Windows & Linux, and for Arrow's parquet vs fastparquet.
>>
>> Is there something pathological about the parquet format that manifests in this regime, or is it rather that the code might not have been optimised for this? Aware that ncols >> nrows is not ideal, but was hoping for less of a cliff.
>>
>> Happy to dig in, but polling experts first.
>>
>> Best,
>> -J
>>
>> >python benchmark.py
>> 2021-07-13 16:31:54.786 INFO     Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
>> 2021-07-13 16:31:55.123 INFO     Written.
>> 2021-07-13 16:31:55.123 INFO     Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
>> 2021-07-13 16:31:57.155 INFO     Written.
>> 2021-07-13 16:31:57.155 INFO     Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq [FastParquet]
>> 2021-07-13 16:31:57.789 INFO     Written.
>> 2021-07-13 16:31:57.790 INFO     Writing parquet to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq [FastParquet]
>> 2021-07-13 16:32:03.613 INFO     Written.
>> 2021-07-13 16:32:03.613 INFO     Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
>> 2021-07-13 16:32:03.890 INFO     Read.
>> 2021-07-13 16:32:03.899 INFO     Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
>> 2021-07-13 16:32:08.727 INFO     Read.
>> 2021-07-13 16:32:08.737 INFO     Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [FastParquet]
>> 2021-07-13 16:32:08.983 INFO     Read.
>> 2021-07-13 16:32:08.991 INFO     Reading parquet from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [FastParquet]
>> 2021-07-13 16:32:11.580 INFO     Read.
>> 2021-07-13 16:32:11.589 INFO     Writing Arrow to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> 2021-07-13 16:32:13.057 INFO     Arrow written.
>> 2021-07-13 16:32:13.078 INFO     Writing Arrow to C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> 2021-07-13 16:32:13.425 INFO     Arrow written.
>> 2021-07-13 16:32:13.434 INFO     Reading Arrow from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> 2021-07-13 16:32:13.620 INFO     Read.
>> 2021-07-13 16:32:13.637 INFO     Reading Arrow from C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> 2021-07-13 16:32:13.711 INFO     Read.
>>
>>

Re: parquet performance for wide tables (many columns)

Posted by Martin Percossi <ma...@percossi.com>.
An alternative representation would be to have a single settlement price
column, and add a stock_id column. Instead of a single row for each time
step, you would now have, say, 10K rows - one for each stock.

I think this will yield better performance.

On Tue, 13 Jul 2021, 18:12 Joris Peeters, <jo...@gmail.com>
wrote:

> Hello,
>
> Sending to user@arrow, as that appears the best place for parquet
> questions atm, but feel free to redirect me.
>
> My objective is to store financial data in Parquet files, and read it out
> fast.
> The columns represent stocks (~= 10,000 or so), and each row is a date (~=
> 8000, e.g. 30 years). Values are e.g. settlement prices. I might want to
> use short row groups of e.g. a year each, for quickly getting to smaller
> date ranges, or query for a subset of columns (stocks).
>
> The appeal of parquet is that I could store all of this stuff in one file,
> and use the row-groups + column-select for slicing, rather than have a ton
> of smaller files etc. Would also integrate well with various ML tech.
>
> When doing some basic performance testing, with random data, I noticed
> that the performance for tables with many columns seems fairly poor. I've
> attached a little benchmark script - see output at the bottom.
>
> Stylised conslusions,
> - Reading/writing a "tall" (nrows >> ncols) dataframe is *much* more
> performant than a "wide" dataframe.
> - with the Arrow format (as opposed to parquet), the difference is much
> smaller.
> - Similar results on Windows & Linux, and for Arrow's parquet vs
> fastparquet.
>
> Is there something pathological about the parquet format that manifests in
> this regime, or is it rather that the code might not have been optimised
> for this? Aware that ncols >> nrows is not ideal, but was hoping for less
> of a cliff.
>
> Happy to dig in, but polling experts first.
>
> Best,
> -J
>
> >python benchmark.py
> 2021-07-13 16:31:54.786 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
> 2021-07-13 16:31:55.123 INFO     Written.
> 2021-07-13 16:31:55.123 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
> 2021-07-13 16:31:57.155 INFO     Written.
> 2021-07-13 16:31:57.155 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq
> [FastParquet]
> 2021-07-13 16:31:57.789 INFO     Written.
> 2021-07-13 16:31:57.790 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq
> [FastParquet]
> 2021-07-13 16:32:03.613 INFO     Written.
> 2021-07-13 16:32:03.613 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
> 2021-07-13 16:32:03.890 INFO     Read.
> 2021-07-13 16:32:03.899 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
> 2021-07-13 16:32:08.727 INFO     Read.
> 2021-07-13 16:32:08.737 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq
> [FastParquet]
> 2021-07-13 16:32:08.983 INFO     Read.
> 2021-07-13 16:32:08.991 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq
> [FastParquet]
> 2021-07-13 16:32:11.580 INFO     Read.
> 2021-07-13 16:32:11.589 INFO     Writing Arrow to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
> 2021-07-13 16:32:13.057 INFO     Arrow written.
> 2021-07-13 16:32:13.078 INFO     Writing Arrow to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
> 2021-07-13 16:32:13.425 INFO     Arrow written.
> 2021-07-13 16:32:13.434 INFO     Reading Arrow from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
> 2021-07-13 16:32:13.620 INFO     Read.
> 2021-07-13 16:32:13.637 INFO     Reading Arrow from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
> 2021-07-13 16:32:13.711 INFO     Read.
>
>
>