You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/07/05 23:49:00 UTC
[jira] [Commented] (DRILL-4139) Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types

    [ https://issues.apache.org/jira/browse/DRILL-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075638#comment-16075638 ] 

ASF GitHub Bot commented on DRILL-4139:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/805#discussion_r125784066
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java ---
    @@ -1008,8 +1008,24 @@ public void setMax(Object max) {
           return nulls;
         }
     
    -    @Override public boolean hasSingleValue() {
    -      return (max != null && min != null && max.equals(min));
    +    /**
    +     * Checks that the column chunk has single value.
    +     * Returns true if min and max are the same, but not null.
    +     * Returns true if min and max are null and the number of null values
    +     * in the column chunk is greater than 0.
    +     *
    +     * @return true if column has single value
    --- End diff --
    
    My understanding is hasSingleValue() return true if the column meta data shows only one single value.  A null value is also counted as a different value from other non-null value.
    
    Therefore, for the case of  column has min != null && max !=null && min.equals(max) && nulls!=null && nulls > 0, it should return false. However, in both the implementation of v1 and v3, it would return true. 
    
    That would actually lead to wrong query result.  A simple reproduce:
    
    ```
    create table dfs.tmp.`t5/a` as select 100 as mykey from cp.`tpch/nation.parquet` union all select col_notexist from cp.`tpch/region.parquet`;
    
    create table dfs.tmp.`t5/b` as select 200 as mykey from cp.`tpch/nation.parquet` union all select col_notexist from cp.`tpch/region.parquet`;
    ```
    
    We got two files, each having one single unique non-null value, plus null values. Now query the two files:
    
    ```
    select mykey from dfs.tmp.`t5` where mykey = 100;
    +--------+
    | mykey  |
    +--------+
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | 100    |
    | null   |
    | null   |
    | null   |
    | null   |
    | null   |
    +--------+
    30 rows selected (0.246 seconds)
    
    ```
    Apparently, those 5 nulls should not be returned. 
    
    I applied the 3 commits in this PR on top of today's master branch.
    
    ```
    select * from sys.version;
    +------------------+-------------------------------------------+-------------------------------------------------------------------------------+----------------------------+-----------------+----------------------------+
    |     version      |                 commit_id                 |                                commit_message                                 |        commit_time         |   build_email   |         build_time         |
    +------------------+-------------------------------------------+-------------------------------------------------------------------------------+----------------------------+-----------------+----------------------------+
    | 1.11.0-SNAPSHOT  | cad6e4dc950aa4a95ad20515ce5abd9c546d3e5d  | DRILL-4139: Fix loss of scale value for DECIMAL in parquet partition pruning  | 05.07.2017 @ 12:05:25 PDT  | jni@apache.org  | 05.07.2017 @ 12:06:07 PDT  |
    +------------------+-------------------------------------------+-----
    ```


> Fix parquet partition pruning for BIT, INTERVAL and DECIMAL types
> -----------------------------------------------------------------
>
>                 Key: DRILL-4139
>                 URL: https://issues.apache.org/jira/browse/DRILL-4139
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.3.0
>         Environment: 4 node cluster on CentOS
>            Reporter: Khurram Faraaz
>            Assignee: Volodymyr Vysotskyi
>
> Exception while trying to prune partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
> is seen in drillbit.log after Functional run on 4 node cluster.
> Drill 1.3.0 sys.version => d61bb83a8
> {code}
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO  o.a.d.e.p.l.partition.PruneScanRule - Beginning partition pruning, pruning class: org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2
> 2015-11-27 03:12:19,809 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] INFO  o.a.d.e.p.l.partition.PruneScanRule - Total elapsed time to build and analyze filter tree: 0 ms
> 2015-11-27 03:12:19,810 [29a835ec-3c02-0fb6-d3c1-bae276ef7385:foreman] WARN  o.a.d.e.p.l.partition.PruneScanRule - Exception while trying to prune partition.
> java.lang.UnsupportedOperationException: Unsupported type: BIT
>         at org.apache.drill.exec.store.parquet.ParquetGroupScan.populatePruningVector(ParquetGroupScan.java:479) ~[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.ParquetPartitionDescriptor.populatePartitionVectors(ParquetPartitionDescriptor.java:96) ~[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:235) ~[drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.logical.partition.ParquetPruneScanRule$2.onMatch(ParquetPruneScanRule.java:87) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.calcite.prepare.PlannerImpl.transform(PlannerImpl.java:303) [calcite-core-1.4.0-drill-r8.jar:1.4.0-drill-r8]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.logicalPlanningVolcanoAndLopt(DefaultSqlHandler.java:545) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:213) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:248) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:164) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:184) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:905) [drill-java-exec-1.3.0.jar:1.3.0]
>         at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:244) [drill-java-exec-1.3.0.jar:1.3.0]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
>         at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)