You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Zoltan Haindrich <ki...@rxd.hu> on 2018/05/15 13:07:48 UTC

Review Request 67126: HIVE-19326: stats auto gather: incorrect aggregation during UNION queries (may lead to incorrect results)

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67126/
-----------------------------------------------------------

Review request for hive, Ashutosh Chauhan and Sergey Shelukhin.


Bugs: HIVE-19326
    https://issues.apache.org/jira/browse/HIVE-19326


Repository: hive-git


Description
-------

in queries like: INSERT ... SELECT ... UNION ALL SELECT ... 
the stats are only collected for the first select

there are 2 issues fixed - which both resulted in the same result:

* statscollectors have overwritten eachothers result; because the filename was only dependent from the resulting table name
* in case tez.merge.files the 2. task have not been set to collect statistics


Diffs
-----

  itests/src/test/resources/testconfiguration.properties 13c08de3c57fdd7fcd4181814bb8e547c699b9f1 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 01a5b4c9c328cb034a613a1539cea2584e122fb4 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 108bb57c4189b720e672eb6f09b1ef05f78448c2 
  ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 07991811f92bc7accae2fde23244f424bdd64c6b 
  ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 605bb09caba3b60ad8d51e50dace70757ab80188 
  ql/src/java/org/apache/hadoop/hive/ql/stats/StatsCollectionContext.java 5c3328c63e8ca37e284e1dc1cdbee5969e185a80 
  ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsPublisher.java 902b37f7874dd5b1afaf8c8bb1259c6f0ddf817f 
  ql/src/test/queries/clientpositive/union_fast_stats.q d69bef3ac083d5a06acda9f47e5d2c1cbe2dfb69 
  ql/src/test/queries/clientpositive/union_rowcounts.q PRE-CREATION 
  ql/src/test/queries/clientpositive/union_stats.q 0e91c23fea475ec95a7fa67433707cb290b277a2 
  ql/src/test/results/clientpositive/llap/multiMapJoin1.q.out f8adcd4ba24f2122f7e7e20770e24a71cfb01a7e 
  ql/src/test/results/clientpositive/llap/union_fast_stats.q.out 4ca5f47a850ba62290c0845eb11a8d0a32780526 
  ql/src/test/results/clientpositive/llap/union_rowcounts.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/llap/union_stats.q.out 5a088f40f5fea41d04a5b6cb6c12bf852a22f097 
  ql/src/test/results/clientpositive/union_stats.q.out 8bd3f44b6e276d6636082673084d66bba3b5c0d3 


Diff: https://reviews.apache.org/r/67126/diff/1/


Testing
-------


Thanks,

Zoltan Haindrich


Re: Review Request 67126: HIVE-19326: stats auto gather: incorrect aggregation during UNION queries (may lead to incorrect results)

Posted by Zoltan Haindrich <ki...@rxd.hu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67126/#review203111
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
Line 1910 (original), 1910 (patched)
<https://reviews.apache.org/r/67126/#comment285189>

    this change fixes the tez.merge.files case; because the only problem in that cases is that the second filesink is not gathering stats
    
    my own opinion: is that gathering statistics has no real overhead (it writes a file)...I think by enabling it here and there it somewhat just adds complexity



ql/src/test/results/clientpositive/llap/union_stats.q.out
Lines 427 (patched)
<https://reviews.apache.org/r/67126/#comment285190>

    FS_7 is present 2 times in this plan
    
    operator ids are reused multiple times in queries like:
    
        from (select * from src union all select * from src)s
        insert overwrite table t1 select *
        insert overwrite table t2 select *;
    
    if I understand correctly actually the file sink id's are reused for in every union branch to do output.
    
    HIVE-19237 should fix this; and probably also remove indexInTezUnion setters/etc


- Zoltan Haindrich


On May 15, 2018, 1:07 p.m., Zoltan Haindrich wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67126/
> -----------------------------------------------------------
> 
> (Updated May 15, 2018, 1:07 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Sergey Shelukhin.
> 
> 
> Bugs: HIVE-19326
>     https://issues.apache.org/jira/browse/HIVE-19326
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> in queries like: INSERT ... SELECT ... UNION ALL SELECT ... 
> the stats are only collected for the first select
> 
> there are 2 issues fixed - which both resulted in the same result:
> 
> * statscollectors have overwritten eachothers result; because the filename was only dependent from the resulting table name
> * in case tez.merge.files the 2. task have not been set to collect statistics
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 13c08de3c57fdd7fcd4181814bb8e547c699b9f1 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 01a5b4c9c328cb034a613a1539cea2584e122fb4 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 108bb57c4189b720e672eb6f09b1ef05f78448c2 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 07991811f92bc7accae2fde23244f424bdd64c6b 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 605bb09caba3b60ad8d51e50dace70757ab80188 
>   ql/src/java/org/apache/hadoop/hive/ql/stats/StatsCollectionContext.java 5c3328c63e8ca37e284e1dc1cdbee5969e185a80 
>   ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsPublisher.java 902b37f7874dd5b1afaf8c8bb1259c6f0ddf817f 
>   ql/src/test/queries/clientpositive/union_fast_stats.q d69bef3ac083d5a06acda9f47e5d2c1cbe2dfb69 
>   ql/src/test/queries/clientpositive/union_rowcounts.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/union_stats.q 0e91c23fea475ec95a7fa67433707cb290b277a2 
>   ql/src/test/results/clientpositive/llap/multiMapJoin1.q.out f8adcd4ba24f2122f7e7e20770e24a71cfb01a7e 
>   ql/src/test/results/clientpositive/llap/union_fast_stats.q.out 4ca5f47a850ba62290c0845eb11a8d0a32780526 
>   ql/src/test/results/clientpositive/llap/union_rowcounts.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/union_stats.q.out 5a088f40f5fea41d04a5b6cb6c12bf852a22f097 
>   ql/src/test/results/clientpositive/union_stats.q.out 8bd3f44b6e276d6636082673084d66bba3b5c0d3 
> 
> 
> Diff: https://reviews.apache.org/r/67126/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zoltan Haindrich
> 
>