You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Xinli Shang (Jira)" <ji...@apache.org> on 2020/11/04 15:05:00 UTC

[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

    [ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226089#comment-17226089 ] 

Xinli Shang commented on PARQUET-1927:
--------------------------------------

[~gszadovszky], I just realized the RowGroupFilter only applies the stats from ColumnChunkMetaData instead of page-level stats.  There is a chance that ColumnChunkMetaData stats say yes, but page-level stats say no. In that case, readNextFilteredRowGroup() can still skip block. 

> ColumnIndex should provide number of records skipped 
> -----------------------------------------------------
>
>                 Key: PARQUET-1927
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1927
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet that how many records that we skipped due to ColumnIndex filtering. When rowCount is 0, readNextFilteredRowGroup() just advance to next without telling the caller. See code here [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the following code():
> valuesRead + skippedValues < totalValues
> See ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] 
> So without knowing the skipped values, it is hard to determine hasNext() or not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() returns null, we consider it is done for the whole file. Then hasNext() just retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)