You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/03/27 23:02:00 UTC

[jira] [Commented] (IMPALA-11081) Partition key scan optimization may return incorrect results when partition file have more than one block

    [ https://issues.apache.org/jira/browse/IMPALA-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705723#comment-17705723 ] 

ASF subversion and git services commented on IMPALA-11081:
----------------------------------------------------------

Commit 794eb1ba4a6d459379dee91c4274be3f40bd16ac in impala's branch refs/heads/branch-4.1.2 from zhangyifan27
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=794eb1ba4 ]

IMPALA-11081: Fix incorrect results in partition key scan

This patch fixes incorrect results caused by short-circuit partition
key scan in the case where a Parquet/ORC file contains multiple
blocks.

IMPALA-8834 introduced the optimization that generating only one
scan range that corresponding to the first block per file. Backends
only issue footer ranges for Parquet/ORC files for file-metadata-only
queries(see HdfsScanner::IssueFooterRanges()), which leads to
incorrect results if the first block doesn't include a file footer.
This bug is fixed by returning a scan range corresponding to the last
block for Parquet/ORC files to make sure it contains a file footer.

Testing:
- Added e2e tests to verify the fix.

Backport Notes:
- Trivial conflicts in HdfsScanNode.java and test_partition_metadata.py

Change-Id: I17331ed6c26a747e0509dcbaf427cd52808943b1
Reviewed-on: http://gerrit.cloudera.org:8080/19471
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Partition key scan optimization may return incorrect results when partition file have more than one block
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11081
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11081
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 4.0.0
>            Reporter: carolinchen
>            Assignee: YifanZhang
>            Priority: Critical
>             Fix For: Impala 4.3.0
>
>
>  In https://issues.apache.org/jira/browse/IMPALA-8834  will only generate one scan range for partition key's scan, but it may cause wrong result. 
> In this case, when a file with more than one block.
>  # The planner will only transforms the first block into TScanRange,  which does not include footer.
>  # The backend can't find the split with the footer,  so that can neither parse the footer nor do the scan.
> so that  the paritition key scan's result will be incorrect. 
>  
> see this snippet in HdfsScanNode.java:
>  
> {code:java}
> private Pair<Boolean, Long> transformBlocksToScanRanges(
>     FeFsPartition partition, FileDescriptor fileDesc, 
>     boolean fsHasBlocks, long scanRangeBytesLimit, 
>     Analyzer analyzer) { 
>     for (int i = 0; i < fileDesc.getNumFileBlocks(); ++i) {
>       // Only generate one scan range for partition key scans.      
>       if (isPartitionKeyScan_) break;
>     }
> }{code}
> In FE,  if file with more than one block do partition key scan,  transformBlocksToScanRanges will not include footer range. 
>  
> see this snippet in hdfs-scanner.cc:
>  
> {code:java}
> /// Issue just the footer range for each file. This function is only used /// in parquet and orc scanners. We'll then parse the footer and pick out /// the columns we want.  
> Status HdfsScanner::IssueFooterRanges(HdfsScanNodeBase* scan_node, 
>     const THdfsFileFormat::type& file_type, 
>     const std::vector<HdfsFileDesc*>& files) {
>     // Try to find the split with the footer.    
>     ScanRange* footer_split = FindFooterSplit(files[i]);
> }{code}
> In BE, there no footer split won't add range to do the scan. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org