You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Nandor Kollar (JIRA)" <ji...@apache.org> on 2018/09/26 14:34:00 UTC

[jira] [Updated] (PARQUET-1290) Clarify maximum run lengths for RLE encoding

     [ https://issues.apache.org/jira/browse/PARQUET-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nandor Kollar updated PARQUET-1290:
-----------------------------------
    Fix Version/s: format-2.6.0

> Clarify maximum run lengths for RLE encoding
> --------------------------------------------
>
>                 Key: PARQUET-1290
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1290
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>             Fix For: format-2.6.0
>
>
> The Parquet spec isn't clear about what the upper bound on run lengths in the RLE encoding is - https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 .
> It sounds like in practice that the major implementations don't support run lengths > (2^31 - 1) - see https://lists.apache.org/thread.html/6731a94a98b790ad24a9a5bb4e1bf9bb799d729e948e046efb40014f@%3Cdev.parquet.apache.org%3E
> I propose that we limit {{bit-pack-count}} and {{number of times repeated}} to being <= 2^31.
> It seems unlikely that there are parquet files in existence with larger run lengths, given that it requires huge numbers of values per page and major implementations can't write or read such files without overflowing integers. Maybe it would be possible if all the columns in a file were extremely compressible, but it seems like in practice most implementations will hit page or file size limits before producing a very-large run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)