You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Julien Le Dem (JIRA)" <ji...@apache.org> on 2016/02/11 18:26:18 UTC

[jira] [Resolved] (PARQUET-505) Column reader: automatically handle large data pages

     [ https://issues.apache.org/jira/browse/PARQUET-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Le Dem resolved PARQUET-505.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: cpp-0.1

Issue resolved by pull request 44
[https://github.com/apache/parquet-cpp/pull/44]

> Column reader: automatically handle large data pages
> ----------------------------------------------------
>
>                 Key: PARQUET-505
>                 URL: https://issues.apache.org/jira/browse/PARQUET-505
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Assignee: Deepak Majeti
>             Fix For: cpp-0.1
>
>
> Currently, we are only supporting data pages whose headers are 64K or less (see {{parquet/column/serialized-page.cc}}. Since page headers can essentially be arbitrarily large (in pathological cases) because of the page statistics, if deserializing the page header fails, we should attempt to read a progressively larger amount of file data in effort to find the end of the page header. 
> As part of this (and to make testing easier!), the maximum data page header size should be configurable. We can write test cases by defining appropriate Statistics structs to yield serialized page headers of whatever desired size.
> On malformed files, we may run past the end of the file, in such cases we should raise a reasonable exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)