You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Robin Aly <ro...@nedap.com> on 2019/11/01 09:31:55 UTC

Selection of encoding scheme

Hi,

I have a conceptual question about the selection of encoding schemes for parquet columns. Hopefully I didn’t miss this question in the archive.

If I understand correctly, arrow implements “all” encoding schemes that parquet supports. But how are these selected for given data of a column/dataset? Is this selection data driven (test on a small subset)? Can I somehow influence the selection?

Background: I am using python to store a pandas dataframe with relative standard iot data (device_id, timestamp, value).

device_id           timestamp     value
        0 2016-02-18 21:01:27  0.797649
        0 2016-02-18 23:01:27  0.485878
        0 2016-02-19 01:01:27  0.738183
        0 2016-02-19 03:01:27  0.866196
        0 2016-02-19 05:01:27  0.731805
      ...                 ...       ...
     9999 2016-04-17 08:49:21  0.794262
     9999 2016-04-17 10:49:21  0.659690
     9999 2016-04-17 12:49:21  0.885828
     9999 2016-04-17 14:49:21  0.000009
     9999 2016-04-17 16:49:21  0.805664

I am surprised that pyarrow doesn’t choose the delta / rle encoding for timestamp as it is increasing in fixed deletas per device_id:


row group 0

--------------------------------------------------------------------------------

device_id:  INT64 GZIP DO:0 FPO:4 SZ:156990/83620663/532.65 VC:10451833 [more]...

timestamp:  INT64 GZIP DO:0 FPO:157081 SZ:54258488/83620743/1.54 VC:10451833 [more]...

value:      DOUBLE GZIP DO:0 FPO:54415661 SZ:78769352/83620743/1.06 VC:10451833 [more]...

Any help / pointers is welcome.

Cheers
Robin

Re: Selection of encoding scheme

Posted by Wes McKinney <we...@gmail.com>.

hi Robin,

The only encodings supported currently in C++ via pyarrow are
dictionary encoding and plain encoding. If the dictionary grows too
large, then it "falls back" to plain encoding. More details here

http://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/

There are "V2" encodings for may be more efficient for your data, but
these need some implementation love to be made available. Note that
ParquetV2 files are not considered "production" so if you use these V2
encodings, your files may not be readable everywhere.

- Wes

On Fri, Nov 1, 2019 at 4:32 AM Robin Aly <ro...@nedap.com> wrote:
>
> Hi,
>
>
>
> I have a conceptual question about the selection of encoding schemes for parquet columns. Hopefully I didn’t miss this question in the archive.
>
>
>
> If I understand correctly, arrow implements “all” encoding schemes that parquet supports. But how are these selected for given data of a column/dataset? Is this selection data driven (test on a small subset)? Can I somehow influence the selection?
>
>
>
> Background: I am using python to store a pandas dataframe with relative standard iot data (device_id, timestamp, value).
>
>
>
> device_id           timestamp     value
>
>         0 2016-02-18 21:01:27  0.797649
>
>         0 2016-02-18 23:01:27  0.485878
>
>         0 2016-02-19 01:01:27  0.738183
>
>         0 2016-02-19 03:01:27  0.866196
>
>         0 2016-02-19 05:01:27  0.731805
>
>       ...                 ...       ...
>
>      9999 2016-04-17 08:49:21  0.794262
>
>      9999 2016-04-17 10:49:21  0.659690
>
>      9999 2016-04-17 12:49:21  0.885828
>
>      9999 2016-04-17 14:49:21  0.000009
>
>      9999 2016-04-17 16:49:21  0.805664
>
>
>
> I am surprised that pyarrow doesn’t choose the delta / rle encoding for timestamp as it is increasing in fixed deletas per device_id:
>
>
>
> row group 0
>
> --------------------------------------------------------------------------------
>
> device_id:  INT64 GZIP DO:0 FPO:4 SZ:156990/83620663/532.65 VC:10451833 [more]...
>
> timestamp:  INT64 GZIP DO:0 FPO:157081 SZ:54258488/83620743/1.54 VC:10451833 [more]...
>
> value:      DOUBLE GZIP DO:0 FPO:54415661 SZ:78769352/83620743/1.06 VC:10451833 [more]...
>
>
>
> Any help / pointers is welcome.
>
>
>
> Cheers
>
> Robin