You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Robin Aly <ro...@nedap.com> on 2019/11/01 09:31:55 UTC
Selection of encoding scheme
Hi,
I have a conceptual question about the selection of encoding schemes for parquet columns. Hopefully I didn’t miss this question in the archive.
If I understand correctly, arrow implements “all” encoding schemes that parquet supports. But how are these selected for given data of a column/dataset? Is this selection data driven (test on a small subset)? Can I somehow influence the selection?
Background: I am using python to store a pandas dataframe with relative standard iot data (device_id, timestamp, value).
device_id timestamp value
0 2016-02-18 21:01:27 0.797649
0 2016-02-18 23:01:27 0.485878
0 2016-02-19 01:01:27 0.738183
0 2016-02-19 03:01:27 0.866196
0 2016-02-19 05:01:27 0.731805
... ... ...
9999 2016-04-17 08:49:21 0.794262
9999 2016-04-17 10:49:21 0.659690
9999 2016-04-17 12:49:21 0.885828
9999 2016-04-17 14:49:21 0.000009
9999 2016-04-17 16:49:21 0.805664
I am surprised that pyarrow doesn’t choose the delta / rle encoding for timestamp as it is increasing in fixed deletas per device_id:
row group 0
--------------------------------------------------------------------------------
device_id: INT64 GZIP DO:0 FPO:4 SZ:156990/83620663/532.65 VC:10451833 [more]...
timestamp: INT64 GZIP DO:0 FPO:157081 SZ:54258488/83620743/1.54 VC:10451833 [more]...
value: DOUBLE GZIP DO:0 FPO:54415661 SZ:78769352/83620743/1.06 VC:10451833 [more]...
Any help / pointers is welcome.
Cheers
Robin
Re: Selection of encoding scheme
Posted by Wes McKinney <we...@gmail.com>.
hi Robin,
The only encodings supported currently in C++ via pyarrow are
dictionary encoding and plain encoding. If the dictionary grows too
large, then it "falls back" to plain encoding. More details here
http://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/
There are "V2" encodings for may be more efficient for your data, but
these need some implementation love to be made available. Note that
ParquetV2 files are not considered "production" so if you use these V2
encodings, your files may not be readable everywhere.
- Wes
On Fri, Nov 1, 2019 at 4:32 AM Robin Aly <ro...@nedap.com> wrote:
>
> Hi,
>
>
>
> I have a conceptual question about the selection of encoding schemes for parquet columns. Hopefully I didn’t miss this question in the archive.
>
>
>
> If I understand correctly, arrow implements “all” encoding schemes that parquet supports. But how are these selected for given data of a column/dataset? Is this selection data driven (test on a small subset)? Can I somehow influence the selection?
>
>
>
> Background: I am using python to store a pandas dataframe with relative standard iot data (device_id, timestamp, value).
>
>
>
> device_id timestamp value
>
> 0 2016-02-18 21:01:27 0.797649
>
> 0 2016-02-18 23:01:27 0.485878
>
> 0 2016-02-19 01:01:27 0.738183
>
> 0 2016-02-19 03:01:27 0.866196
>
> 0 2016-02-19 05:01:27 0.731805
>
> ... ... ...
>
> 9999 2016-04-17 08:49:21 0.794262
>
> 9999 2016-04-17 10:49:21 0.659690
>
> 9999 2016-04-17 12:49:21 0.885828
>
> 9999 2016-04-17 14:49:21 0.000009
>
> 9999 2016-04-17 16:49:21 0.805664
>
>
>
> I am surprised that pyarrow doesn’t choose the delta / rle encoding for timestamp as it is increasing in fixed deletas per device_id:
>
>
>
> row group 0
>
> --------------------------------------------------------------------------------
>
> device_id: INT64 GZIP DO:0 FPO:4 SZ:156990/83620663/532.65 VC:10451833 [more]...
>
> timestamp: INT64 GZIP DO:0 FPO:157081 SZ:54258488/83620743/1.54 VC:10451833 [more]...
>
> value: DOUBLE GZIP DO:0 FPO:54415661 SZ:78769352/83620743/1.06 VC:10451833 [more]...
>
>
>
> Any help / pointers is welcome.
>
>
>
> Cheers
>
> Robin