You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/04/18 05:34:00 UTC

[jira] [Resolved] (PARQUET-2029) Delta-bitpacked encoding Java implementation seems incorrect

     [ https://issues.apache.org/jira/browse/PARQUET-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jorge Leitão resolved PARQUET-2029.
-----------------------------------
    Resolution: Not A Problem

I made a mistake in computing the ul128: it should take 128 and 1. All good. Sorry for the noise.

> Delta-bitpacked encoding Java implementation seems incorrect
> ------------------------------------------------------------
>
>                 Key: PARQUET-2029
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2029
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Jorge Leitão
>            Priority: Major
>
> This assumes that spark==3.1.1 uses the java reference implementation. Consider
> {code:java}
> import pyspark.sql
> from pyspark.sql.types import *
> spark = (
>     pyspark.sql.SparkSession.builder.config(
>         "spark.sql.parquet.compression.codec",
>         "uncompressed",
>     )
>     .config("spark.hadoop.parquet.writer.version", "v2")
>     .getOrCreate()
> )
> schema = StructType([StructField("label", IntegerType(), False)])
> df = spark.createDataFrame([[1], [2], [3], [4], [5]], schema).coalesce(1)
> df.write.parquet("bla.parquet", mode="overwrite")
> {code}
> This has no def or rep levels, is encoded with DELTA_BINARY_PACKED = 5, and results in the following page buffer:
> {code}
> buffer: [
>             128,
>             1,
>             4,
>             5,
>             2,
>             2,
>             0,
>             0,
>             0,
>             0,
>         ],
> {code}
> Let's use notation
> {code}
> (encoded <=u> decoded via uleb128)
> (encoded <=z> decoded via uleb128 zig-zag)
> {code}
> Let's decode the above buffer manually according [to the specification|https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5].
> {code}
> byte 0: 128 <=u> 128, the <block size in values>
> byte 1: 1 <=u> 1, the <number of miniblocks in a block>
> byte 2: 4 <=u> 4, the <total value count>
> byte 3: 5 <=z> -3, the <first value>
> byte 4: 2 <=z> 1, the <min delta>
> byte 5: 2 <=u> 2, the bit width of the (only) miniblock
> {code}
> I think that this is incorrect: the first value should be 1, not -3; the total count should be 5, not 4, the bit width of the miniblock should be 0, not 2.
> Note that if byte 2 was removed, everything would be correct:
> {code}
> byte 0: 128 <=u> 128, the <block size in values>
> byte 1: 1 <=u> 1, the <number of miniblocks in a block>
> byte 2: 5 <=u> 5, the <total value count>
> byte 3: 2 <=z> 1, the <first value>
> byte 4: 2 <=z> 1, the <min delta>
> byte 5: 0 <=u> 0, the bit width of the (only) miniblock
> {code}
> which corresponds to the delta-bitpack encoding of [1,2,3,4,5]: 
> * 5 elements
> * first value is 1
> * the min delta is 1
> * the relative differences are zero (bit_width=0) and thus do not require a buffer (writing 0 is also fine)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)