You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by rahul challapalli <ch...@gmail.com> on 2015/07/11 01:27:16 UTC

Hash Agg vs Streaming Agg for a smaller data set

Hi,

Info about Data : The data is auto partitioned tpch 0.01 data. The second
filter is a non-partitioned column, so in the first case the 'OR' predicate
results in a full-table scan, while in the second case, partition pruning
takes effect.

The first case results in a hash agg and the second case in a streaming
agg. Any idea why?

1. explain plan for select distinct l_modline, l_moddate from
`tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
'1992-01-01' or l_shipdate=date'1992-01-01';
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(l_modline=[$0], l_moddate=[$1])
00-02        Project(l_modline=[$0], l_moddate=[$1])
00-03          HashAgg(group=[{0, 1}])
00-04            Project(l_modline=[$2], l_moddate=[$0])
00-05              SelectionVectorRemover
00-06                Filter(condition=[OR(=($0, 1992-01-01), =($1,
1992-01-01))])
00-07                  Project(l_moddate=[$2], l_shipdate=[$1],
l_modline=[$0])
00-08                    Scan..........

2. explain plan for select distinct l_modline, l_moddate from
`tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
'1992-01-01' and l_shipdate=date'1992-01-01';
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(l_modline=[$0], l_moddate=[$1])
00-02        Project(l_modline=[$0], l_moddate=[$1])
00-03          StreamAgg(group=[{0, 1}])
00-04            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
00-05              Project(l_modline=[$2], l_moddate=[$0])
00-06                SelectionVectorRemover
00-07                  Filter(condition=[AND(=($0, 1992-01-01), =($1,
1992-01-01))])
00-08                    Project(l_moddate=[$2], l_shipdate=[$1],
l_modline=[$0])
00-09                      Scan.....................

- Rahul

Re: Hash Agg vs Streaming Agg for a smaller data set

Posted by rahul challapalli <ch...@gmail.com>.

That could be the reason, in the first query we are scanning 64000 records
and in the second case just 108 records. Thanks for the replies!

On Fri, Jul 10, 2015 at 4:48 PM, Jinfeng Ni <ji...@gmail.com> wrote:

> I'm not clear which column is the partitioning column. From what you
> described, row count of aggregator in the first case is larger than that in
> the second case, since the former one requires full table scan. Cost-wise,
> hash-agg would make more sense when the input is larger, since
> streaming-agg requires sort, which could be expensive for large dataset.
>
> My guess is the difference of rowcounts in the two cases cause the
> difference in the query plan.
>
> One suggestion. If you want to check query plan, it would make more sense
> to try with reasonably large data.  Drill's costing model is not fully
> calibrated yet;  a small dataset like tpch_0.0.1 might make it hard for the
> cost model to pick the right plan. On the other hand, if the dataset is
> small, two different plans normally would not make a big difference in
> terms of performance. In other words, try to use large dataset if you are
> interested in performance testing / plan verification.
>
>
>
>
>
> On Fri, Jul 10, 2015 at 4:27 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Hi,
> >
> > Info about Data : The data is auto partitioned tpch 0.01 data. The second
> > filter is a non-partitioned column, so in the first case the 'OR'
> predicate
> > results in a full-table scan, while in the second case, partition pruning
> > takes effect.
> >
> > The first case results in a hash agg and the second case in a streaming
> > agg. Any idea why?
> >
> > 1. explain plan for select distinct l_modline, l_moddate from
> > `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> > '1992-01-01' or l_shipdate=date'1992-01-01';
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(l_modline=[$0], l_moddate=[$1])
> > 00-02        Project(l_modline=[$0], l_moddate=[$1])
> > 00-03          HashAgg(group=[{0, 1}])
> > 00-04            Project(l_modline=[$2], l_moddate=[$0])
> > 00-05              SelectionVectorRemover
> > 00-06                Filter(condition=[OR(=($0, 1992-01-01), =($1,
> > 1992-01-01))])
> > 00-07                  Project(l_moddate=[$2], l_shipdate=[$1],
> > l_modline=[$0])
> > 00-08                    Scan..........
> >
> > 2. explain plan for select distinct l_modline, l_moddate from
> > `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> > '1992-01-01' and l_shipdate=date'1992-01-01';
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(l_modline=[$0], l_moddate=[$1])
> > 00-02        Project(l_modline=[$0], l_moddate=[$1])
> > 00-03          StreamAgg(group=[{0, 1}])
> > 00-04            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> > 00-05              Project(l_modline=[$2], l_moddate=[$0])
> > 00-06                SelectionVectorRemover
> > 00-07                  Filter(condition=[AND(=($0, 1992-01-01), =($1,
> > 1992-01-01))])
> > 00-08                    Project(l_moddate=[$2], l_shipdate=[$1],
> > l_modline=[$0])
> > 00-09                      Scan.....................
> >
> > - Rahul
> >
>

Re: Hash Agg vs Streaming Agg for a smaller data set

Posted by Jinfeng Ni <ji...@gmail.com>.

I'm not clear which column is the partitioning column. From what you
described, row count of aggregator in the first case is larger than that in
the second case, since the former one requires full table scan. Cost-wise,
hash-agg would make more sense when the input is larger, since
streaming-agg requires sort, which could be expensive for large dataset.

My guess is the difference of rowcounts in the two cases cause the
difference in the query plan.

One suggestion. If you want to check query plan, it would make more sense
to try with reasonably large data.  Drill's costing model is not fully
calibrated yet;  a small dataset like tpch_0.0.1 might make it hard for the
cost model to pick the right plan. On the other hand, if the dataset is
small, two different plans normally would not make a big difference in
terms of performance. In other words, try to use large dataset if you are
interested in performance testing / plan verification.

On Fri, Jul 10, 2015 at 4:27 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Hi,
>
> Info about Data : The data is auto partitioned tpch 0.01 data. The second
> filter is a non-partitioned column, so in the first case the 'OR' predicate
> results in a full-table scan, while in the second case, partition pruning
> takes effect.
>
> The first case results in a hash agg and the second case in a streaming
> agg. Any idea why?
>
> 1. explain plan for select distinct l_modline, l_moddate from
> `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> '1992-01-01' or l_shipdate=date'1992-01-01';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(l_modline=[$0], l_moddate=[$1])
> 00-02        Project(l_modline=[$0], l_moddate=[$1])
> 00-03          HashAgg(group=[{0, 1}])
> 00-04            Project(l_modline=[$2], l_moddate=[$0])
> 00-05              SelectionVectorRemover
> 00-06                Filter(condition=[OR(=($0, 1992-01-01), =($1,
> 1992-01-01))])
> 00-07                  Project(l_moddate=[$2], l_shipdate=[$1],
> l_modline=[$0])
> 00-08                    Scan..........
>
> 2. explain plan for select distinct l_modline, l_moddate from
> `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> '1992-01-01' and l_shipdate=date'1992-01-01';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(l_modline=[$0], l_moddate=[$1])
> 00-02        Project(l_modline=[$0], l_moddate=[$1])
> 00-03          StreamAgg(group=[{0, 1}])
> 00-04            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 00-05              Project(l_modline=[$2], l_moddate=[$0])
> 00-06                SelectionVectorRemover
> 00-07                  Filter(condition=[AND(=($0, 1992-01-01), =($1,
> 1992-01-01))])
> 00-08                    Project(l_moddate=[$2], l_shipdate=[$1],
> l_modline=[$0])
> 00-09                      Scan.....................
>
> - Rahul
>

Re: Hash Agg vs Streaming Agg for a smaller data set

Posted by Steven Phillips <sp...@maprtech.com>.

My guess is that in the second query, the size of the dataset is smaller,
and this causes the cost of sorting to be small enough that it is cheaper
than the HashAgg.

On Fri, Jul 10, 2015 at 4:27 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Hi,
>
> Info about Data : The data is auto partitioned tpch 0.01 data. The second
> filter is a non-partitioned column, so in the first case the 'OR' predicate
> results in a full-table scan, while in the second case, partition pruning
> takes effect.
>
> The first case results in a hash agg and the second case in a streaming
> agg. Any idea why?
>
> 1. explain plan for select distinct l_modline, l_moddate from
> `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> '1992-01-01' or l_shipdate=date'1992-01-01';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(l_modline=[$0], l_moddate=[$1])
> 00-02        Project(l_modline=[$0], l_moddate=[$1])
> 00-03          HashAgg(group=[{0, 1}])
> 00-04            Project(l_modline=[$2], l_moddate=[$0])
> 00-05              SelectionVectorRemover
> 00-06                Filter(condition=[OR(=($0, 1992-01-01), =($1,
> 1992-01-01))])
> 00-07                  Project(l_moddate=[$2], l_shipdate=[$1],
> l_modline=[$0])
> 00-08                    Scan..........
>
> 2. explain plan for select distinct l_modline, l_moddate from
> `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> '1992-01-01' and l_shipdate=date'1992-01-01';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(l_modline=[$0], l_moddate=[$1])
> 00-02        Project(l_modline=[$0], l_moddate=[$1])
> 00-03          StreamAgg(group=[{0, 1}])
> 00-04            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 00-05              Project(l_modline=[$2], l_moddate=[$0])
> 00-06                SelectionVectorRemover
> 00-07                  Filter(condition=[AND(=($0, 1992-01-01), =($1,
> 1992-01-01))])
> 00-08                    Project(l_moddate=[$2], l_shipdate=[$1],
> l_modline=[$0])
> 00-09                      Scan.....................
>
> - Rahul
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Hash Agg vs Streaming Agg for a smaller data set

Posted by Steven Phillips <sp...@maprtech.com>.

My guess is that in the second query, the size of the dataset is smaller,
and this causes the cost of sorting to be small enough that it is cheaper
than the HashAgg.

On Fri, Jul 10, 2015 at 4:27 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Hi,
>
> Info about Data : The data is auto partitioned tpch 0.01 data. The second
> filter is a non-partitioned column, so in the first case the 'OR' predicate
> results in a full-table scan, while in the second case, partition pruning
> takes effect.
>
> The first case results in a hash agg and the second case in a streaming
> agg. Any idea why?
>
> 1. explain plan for select distinct l_modline, l_moddate from
> `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> '1992-01-01' or l_shipdate=date'1992-01-01';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(l_modline=[$0], l_moddate=[$1])
> 00-02        Project(l_modline=[$0], l_moddate=[$1])
> 00-03          HashAgg(group=[{0, 1}])
> 00-04            Project(l_modline=[$2], l_moddate=[$0])
> 00-05              SelectionVectorRemover
> 00-06                Filter(condition=[OR(=($0, 1992-01-01), =($1,
> 1992-01-01))])
> 00-07                  Project(l_moddate=[$2], l_shipdate=[$1],
> l_modline=[$0])
> 00-08                    Scan..........
>
> 2. explain plan for select distinct l_modline, l_moddate from
> `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> '1992-01-01' and l_shipdate=date'1992-01-01';
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(l_modline=[$0], l_moddate=[$1])
> 00-02        Project(l_modline=[$0], l_moddate=[$1])
> 00-03          StreamAgg(group=[{0, 1}])
> 00-04            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> 00-05              Project(l_modline=[$2], l_moddate=[$0])
> 00-06                SelectionVectorRemover
> 00-07                  Filter(condition=[AND(=($0, 1992-01-01), =($1,
> 1992-01-01))])
> 00-08                    Project(l_moddate=[$2], l_shipdate=[$1],
> l_modline=[$0])
> 00-09                      Scan.....................
>
> - Rahul
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com