You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/07/31 17:12:00 UTC

[jira] [Commented] (PARQUET-1364) Column Indexes: Invalid row indexes for pages starting with nulls

    [ https://issues.apache.org/jira/browse/PARQUET-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564007#comment-16564007 ] 

ASF GitHub Bot commented on PARQUET-1364:
-----------------------------------------

gszadovszky opened a new pull request #507: PARQUET-1364: Invalid row indexes for pages starting with nulls
URL: https://github.com/apache/parquet-mr/pull/507
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Column Indexes: Invalid row indexes for pages starting with nulls
> -----------------------------------------------------------------
>
>                 Key: PARQUET-1364
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1364
>             Project: Parquet
>          Issue Type: Sub-task
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>              Labels: pull-request-available
>
> The current implementation for writing managing row indexes for the pages is not reliable. There is a logic [MessageColumnIO|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L153] which caches null values and flush them just *before* opening a new group. This logic might cause starting pages with these cached nulls which are not correctly counted in the written rows so the rowIndexes are incorrect. It does not cause any issues if all the pages are read continuously put it is a huge problem for column index based filtering.
> The implementation described above is really complicated and would not like to redesign because of the mentioned issue. It is easier to simply count the {{0}} repetition levels as record boundaries at the column writer level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)