You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Eric Hanson (JIRA)" <ji...@apache.org> on 2013/06/11 01:50:20 UTC
[jira] [Updated] (HIVE-4701) Optimize filter Column IN (
list-of-constants ) for vectorized execution
[ https://issues.apache.org/jira/browse/HIVE-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Hanson updated HIVE-4701:
------------------------------
Description:
OR filters have been optimized to run with vectorized query execution. IN filters of the form "Column IN (list-of-constants)" are a special case of OR. However, IN does not vectorized currently.
E.g.
select ddate, count\(\*\) from factsqlengineam_vec_orc where ddate = "2012-05-19 00:00:00" OR ddate = "2012-05-20 00:00:00" or ddate = "2012-05-21 00:00:00" group by ddate;
takes about 23 seconds of CPU and
select ddate, count\(\*\) from factsqlengineam_vec_orc where ddate IN ("2012-05-19 00:00:00", "2012-05-20 00:00:00", "2012-05-21 00:00:00") group by ddate;
takes about 153 seconds of CPU.
A simple fix may be that for short IN lists (say <= 64 elements) we turn them into OR by manipulating the query tree before planning whether vectorization can be used.
A more complex fix that covers more cases would be to turn longer IN lists into a join so when we eventually support vectorized joins it will be fast.
An intermediate approach might be to implement a special IN filter operator that stores the constant values in a sorted array or high-performance hash table (like Cuckoo hashing).
was:
OR filters have been optimized to run with vectorized query execution. IN filters of the form "Column IN (list-of-constants)" are a special case of OR. However, IN does not vectorized currently.
E.g.
select ddate, count(*) from factsqlengineam_vec_orc where ddate = "2012-05-19 00:00:00" OR ddate = "2012-05-20 00:00:00" or ddate = "2012-05-21 00:00:00" group by ddate;
takes about 23 seconds of CPU and
select ddate, count(*) from factsqlengineam_vec_orc where ddate IN ("2012-05-19 00:00:00", "2012-05-20 00:00:00", "2012-05-21 00:00:00") group by ddate;
takes about 153 seconds of CPU.
A simple fix may be that for short IN lists (say <= 64 elements) we turn them into OR by manipulating the query tree before planning whether vectorization can be used.
A more complex fix that covers more cases would be to turn longer IN lists into a join so when we eventually support vectorized joins it will be fast.
An intermediate approach might be to implement a special IN filter operator that stores the constant values in a sorted array or high-performance hash table (like Cuckoo hashing).
> Optimize filter Column IN ( list-of-constants ) for vectorized execution
> ------------------------------------------------------------------------
>
> Key: HIVE-4701
> URL: https://issues.apache.org/jira/browse/HIVE-4701
> Project: Hive
> Issue Type: Sub-task
> Reporter: Eric Hanson
>
> OR filters have been optimized to run with vectorized query execution. IN filters of the form "Column IN (list-of-constants)" are a special case of OR. However, IN does not vectorized currently.
> E.g.
> select ddate, count\(\*\) from factsqlengineam_vec_orc where ddate = "2012-05-19 00:00:00" OR ddate = "2012-05-20 00:00:00" or ddate = "2012-05-21 00:00:00" group by ddate;
> takes about 23 seconds of CPU and
> select ddate, count\(\*\) from factsqlengineam_vec_orc where ddate IN ("2012-05-19 00:00:00", "2012-05-20 00:00:00", "2012-05-21 00:00:00") group by ddate;
> takes about 153 seconds of CPU.
> A simple fix may be that for short IN lists (say <= 64 elements) we turn them into OR by manipulating the query tree before planning whether vectorization can be used.
> A more complex fix that covers more cases would be to turn longer IN lists into a join so when we eventually support vectorized joins it will be fast.
> An intermediate approach might be to implement a special IN filter operator that stores the constant values in a sorted array or high-performance hash table (like Cuckoo hashing).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira