You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/02/15 00:05:00 UTC

[jira] [Updated] (PARQUET-2124) Bad DCHECK For Intermixed Dictionary Encoding

     [ https://issues.apache.org/jira/browse/PARQUET-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated PARQUET-2124:
------------------------------------
    Labels: pull-request-available  (was: )

> Bad DCHECK For Intermixed Dictionary Encoding
> ---------------------------------------------
>
>                 Key: PARQUET-2124
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2124
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: William Butler
>            Assignee: William Butler
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Parquet CPP has a DCHECK for a dictionary encoded page coming after a non-dictionary encoded page. This is bad because the DCHECK can be triggered by Parquet files that have a column that has a dictionary page, then a non-dictionary encoded page, then a page of dictionary encoded values(indices). Fuzzing found such a file. While this could be turned into an exception, I don't see anything in the Parquet specification that prohibits such an occurrence of pages.
> This situation has brought up on the mailing list before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] and it seems like this is valid but nobody is doing it.
> In the PR that added this check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the check is probably not needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)