You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Omega Gamage <om...@gmail.com> on 2020/03/24 03:20:57 UTC

Write a parquet file with delta encoding enable

I was trying to write a parquet file with delta encoding. This page
<https://github.com/apache/parquet-format/blob/master/Encodings.md>, states
that parquet supports three types of delta encoding:

    (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY).

Since spark, pyspark or pyarrow do not allow us to specify the encoding
method, I was curious how one can write a file with delta encoding enabled?

However, I found on the internet that if I have columns with TimeStamp type
parquet will use delta encoding. So I used the following code in scala to
create a parquet file. But encoding is not a delta.


    val df = Seq(("2018-05-01"),
                ("2018-05-02"),
                ("2018-05-03"),
                ("2018-05-04"),
                ("2018-05-05"),
                ("2018-05-06"),
                ("2018-05-07"),
                ("2018-05-08"),
                ("2018-05-09"),
                ("2018-05-10")
            ).toDF("Id")
    val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
    val df3 = df2.withColumn("Date", (col("Id").cast("date")))

    df3.coalesce(1).write.format("parquet").mode("append").save("date_time2")

parquet-tools shows the following information regarding the written parquet
file.

file schema: spark_schema
--------------------------------------------------------------------------------Id:
         OPTIONAL BINARY L:STRING R:0 D:1Timestamp:   OPTIONAL INT96
R:0 D:1Date:        OPTIONAL INT32 L:DATE R:0 D:1

row group 1: RC:31 TS:1100 OFFSET:4
--------------------------------------------------------------------------------Id:
          BINARY SNAPPY DO:0 FPO:4 SZ:230/487/2.12 VC:31
ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
num_nulls: 0]Timestamp:    INT96 SNAPPY DO:0 FPO:234 SZ:212/436/2.06
VC:31 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max
not defined]Date:         INT32 SNAPPY DO:0 FPO:446 SZ:181/177/0.98
VC:31 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
num_nulls: 0]

As you can see, no column has used delta encoding.

My questions are:

   1.

   How can I write a parquet file with delta encoding? (If you can provide
   an example code in scala or python that would be great.)
   2.

   How to decide which "delta encoding": (DELTA_BINARY_PACKED,
   DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY) to use?

Re: Write a parquet file with delta encoding enable

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.

Hi,

I am not aware of any parquet writer implementation that allows to set the
exact encoding for your data. I am working on parquet-mr only, though.
(parquet-mr is the one used by spark.)
In parquet-mr the encoding is chosen based on the data type and some
additional parameters (but none of them are for forcing delta encoding).
Different encodings are used for V1 and V2 pages. See the implementations
at
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java
 and
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java
.
To summarize, no delta encodings are used for V1 pages. They are available
for V2 only. In case of V2 DELTA_BYTE_ARRAY is used for the types
BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY, DELTA_BINARY_PACKED is used for the
types INT32 and INT64. (Both are wrapped in dictionary encoding by default.)
So, to use these encodings, you need to switch spark to write V2 pages. (I
am not sure how to do that and failed to find it in a quick search.)
Please note that V2 pages (and the related encodings) are not supported by
every parquet implementation.

Cheers,
Gabor

On Tue, Mar 24, 2020 at 7:03 AM Omega Gamage <om...@gmail.com> wrote:

> I was trying to write a parquet file with delta encoding. This page
> <https://github.com/apache/parquet-format/blob/master/Encodings.md>,
> states
> that parquet supports three types of delta encoding:
>
>     (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY).
>
> Since spark, pyspark or pyarrow do not allow us to specify the encoding
> method, I was curious how one can write a file with delta encoding enabled?
>
> However, I found on the internet that if I have columns with TimeStamp type
> parquet will use delta encoding. So I used the following code in scala to
> create a parquet file. But encoding is not a delta.
>
>
>     val df = Seq(("2018-05-01"),
>                 ("2018-05-02"),
>                 ("2018-05-03"),
>                 ("2018-05-04"),
>                 ("2018-05-05"),
>                 ("2018-05-06"),
>                 ("2018-05-07"),
>                 ("2018-05-08"),
>                 ("2018-05-09"),
>                 ("2018-05-10")
>             ).toDF("Id")
>     val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
>     val df3 = df2.withColumn("Date", (col("Id").cast("date")))
>
>
> df3.coalesce(1).write.format("parquet").mode("append").save("date_time2")
>
> parquet-tools shows the following information regarding the written parquet
> file.
>
> file schema: spark_schema
>
> --------------------------------------------------------------------------------Id:
>          OPTIONAL BINARY L:STRING R:0 D:1Timestamp:   OPTIONAL INT96
> R:0 D:1Date:        OPTIONAL INT32 L:DATE R:0 D:1
>
> row group 1: RC:31 TS:1100 OFFSET:4
>
> --------------------------------------------------------------------------------Id:
>           BINARY SNAPPY DO:0 FPO:4 SZ:230/487/2.12 VC:31
> ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
> num_nulls: 0]Timestamp:    INT96 SNAPPY DO:0 FPO:234 SZ:212/436/2.06
> VC:31 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max
> not defined]Date:         INT32 SNAPPY DO:0 FPO:446 SZ:181/177/0.98
> VC:31 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
> num_nulls: 0]
>
> As you can see, no column has used delta encoding.
>
> My questions are:
>
>    1.
>
>    How can I write a parquet file with delta encoding? (If you can provide
>    an example code in scala or python that would be great.)
>    2.
>
>    How to decide which "delta encoding": (DELTA_BINARY_PACKED,
>    DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY) to use?
>