You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Alexander Behm (JIRA)" <ji...@apache.org> on 2017/05/31 23:38:04 UTC

[jira] [Created] (IMPALA-5412) Scan returns wrong partition-column values when scanning multiple pointing to the same filesystem location.

Alexander Behm created IMPALA-5412:
--------------------------------------

             Summary: Scan returns wrong partition-column values when scanning multiple pointing to the same filesystem location.
                 Key: IMPALA-5412
                 URL: https://issues.apache.org/jira/browse/IMPALA-5412
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
    Affects Versions: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
            Reporter: Alexander Behm
            Assignee: Alexander Behm
            Priority: Critical


A scan against Avro, RCFile or SequenceFile may wrong partition-column values when scanning multiple partitions pointing to the same filesystem location.

For example, the following setup may return fewer rows than expected, or have incorrect counts.
{code}
// Table contents
partition_col=1 points to /user/hive/warehouse/shared_dir/000000_0
partition_col=2 points to /user/hive/warehouse/shared_dir/000000_0
// Query may return wrong results
SELECT COUNT(*) FROM t GROUP BY partition_col
{code}

In particular, COMPUTE STATS uses the query above to populate the per-partition row counts, so those stored row counts may be incorrect.

This bug only affects the Avro, RCFile or SequenceFile formats and does not affect Text, Parquet or non-filesystem tables like Kudu.

The problematic code can be found in hdfs-scan-node-base.h:
{code}
  /// Scanner specific per file metadata (e.g. header information) and associated lock.
  boost::mutex metadata_lock_;
  std::map<std::string, void*> per_file_metadata_;
{code}
The is that the same file name could belong to multiple partitions, so a scanner may pick up the wrong per-file metadata which includes the partition values.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)