You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@doris.apache.org by Lee Happen <Ha...@hotmail.com> on 2020/08/19 05:08:13 UTC

[Proposal] Support Bucket Shuffle Join for Doris

Motivation

At present, Doris support 3 type join: shuffle join, broadcast join, colocate join.
Except colocate join，another join will lead to a lot of network consumption.

For example, there a SQL A join B, the cost of network.

* broadcast join: if table A is divided into three parts，the net work cost is 3B
* shuffle join: the network cost is A + B.

These network consumption not only leads to slow query, but also leads to extra memory consumption during join.

Each Doris table have disrtribute info, if the join expr hit the distribute info, we should use the distribute info to reduce the network consumption.

What is bucket shuffle join

[image.png]<https://camo.githubusercontent.com/ba3ac7db1e7c983ec0f555d332b1064c69a9dc2a/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d633338336665383461656565313362632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430>

just like Hive's bucket map join, the picture show how it work. if there a SQL A join B, and the join expr hit the distribute info of A. Bucket shuffle join only need distribute table B, sent the data to proper table A part. So the network cost is always B.

So compared with the original join, obviously bucket shuffle join lead to less network cost：

B < min(3B, A + B)

It can bring us the following benefits:

1. First, Bucket Shuffle Join reduce the network cost and lead to a better performance for some join. Especially when the bucket is cropped.

2. It does not strongly rely on the mechanism of collocate, so it is transparent to users. There is no mandatory requirement for data distribution, which will not lead to data skew.

3. It can provide more query optimization space for join reorder.

POC of Bucket Shuffle Join

Now I've implemented a simple Bucket Shuffle join in Doris and test the performance of it.

Now, we chose tpcds query 57. The query have 6 join operation, and 4 of them can hit Bucket shuffle join.

Origin Doris Bucket shuffle join
Time Cost 27.7s 16.4s

It seems to work as well as we expected. I'll do more experiments to verify its performance in the future

Implementation

1. First, we should add a partition type in thrift type

2. FE able to plan and sense queries that can be used bucket shuffle join. send data distribution info to BE

3. BE use the proper hash function to send proper data to proper instance of BE.

Best Wish
Happen Lee

Re: [Proposal] Support Bucket Shuffle Join for Doris

Posted by Zhao Chun <zh...@apache.org>.

Hi, Happen

I don't have a problem with single partition optimization. I just want to
remind you of the following points

1. Tpc-ds is generally a benchmark. In principle, partitions should be
created for real use. Otherwise, the amount of data keeps rising, which
will cause the tablet to become very large. While tpc-ds has many large
tables to join, I understand that the effect of colocated should be better.
In addition, when it comes to bucket cutting, we should be able to push
down the filtering conditions. In this way, the data in the right table has
been filtered before shuffle / broadcast. The amount of data distributed
due to bucket clipping should not exist.

2. Because our data is stored in multiple copies. Only after the
coordinator completes can we determine the tablet to be queried by each
instance. This requires modifying the entire distributed information after
the coordinator.

3. In addition to using CRC32 for shaping, other data types may also need
to be consistent.

Thanks,
Zhao Chun


Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午4:46写道：

> Hi Zhao Chun
>
> 1. It can only be applied to a single partition. It is similar to that
> when the table joins itself, it should use collocate join in the single
> partition scenario. The query like TPCDS is not multi partitioned, so the
> effect will be better.
> Doris don't just have partition cutting, but also bucket cutting. With
> bucket clipping in effect, bucket join can reduce the amount of data sent
> by the network.
> For example, table A has 10 buckets and only 2 buckets after trimming.
> Then table B only needs to send 1 / 5 of the data. Compared with shuffle
> join and broadcast join, which need to send much less data.
>
> 2. The amount of distributed partition data will not be very complex. The
> implementation here is consistent with the original. After the list
> instance is sent, it will know which line to send through the list
> instance. If the instance is tagged to empty, the data will not be sent
> which means the bucket is clipped. The size of this list is consistent with
> the num bucket of partition. The only thing to note is to replace the hash
> algorithm in BE with CRC32
>
> 3. I will post POC profile when I can reuse the test env.
>
>
> Best Wish,
> Happen Lee
> ________________________________
> From: Zhao Chun <zh...@apache.org>
> Sent: August 19, 2020 7:47
> To: dev@doris.apache.org <de...@doris.apache.org>
> Subject: Re: [Proposal] Support Bucket Shuffle Join for Doris
>
> Hi Happen
>
> I think we can only support bucket join of single partition table. But I
> have some concerns
>
> 1. In real use, most scenes are created with multiple partitions. The
> possible scenarios for this optimization are limited.
>
> 2. It will make the data stream sender more complex. Because it is
> necessary to transfer the whole partition distribution information from Fe,
> and the hash mode of data import should be coupled when data hash
> calculation.
>
> In addition, can you post the POC profile content? I want to know the
> specific reasons for the promotion.
>
> Thanks,
> Zhao Chun
>
>
> Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午2:39写道：
>
> >
> > Hi Zhao Chun
> >
> > Let me explain the two questions you mentioned
> >
> >
> > 1. Doris have data partition and data distribution meta,  a certain
> bucket
> > seq and partition can be located to a tablet. Each tablet may be
> > distributed on different machines. if we both consider data partition and
> > data distribution. The Network consumption of bucket shuffle join will
> > degenerate. More machine more degradation.
> >
> > So Bucket shuffle join only when a single partition is queried can it
> play
> > its role
> >
> > 2. Before the query, the partition clipping will be carried out by FE. In
> > many query scenarios, queries are limited to a single partition or even a
> > few buckets in a single partition. In other words, the better the effect
> of
> > partition cutting, The better the performance of Bucket shuffle join.
> >
> > So we only need take care of distribution column, but not the data
> > partition column.
> >
> > 3. if we support bucket join for tables with group info. It's not much
> > different from colocate join. Most scenarios colocate join do well thing
> > than Bucket shuffle join. It also brings the problem of data skew.
> >
> > I think Bucket Shuffle Join Is a special scene optimization, it is
> > transparent to users. If there are multi partition in left table, we
> still
> > can use
> > shuffle join, broadcast join like before.
> >
> >
> > Best Wish,
> > Happen Lee
> > ________________________________
> > From: Zhao Chun <zh...@apache.org>
> > Sent: August 19, 2020 5:53
> > To: dev@doris.apache.org <de...@doris.apache.org>
> > Subject: Re: [Proposal] Support Bucket Shuffle Join for Doris
> >
> > Hi Happen
> >
> > Good proposal.
> >
> > If we want to carry out this work, we need to pay attention to several
> > aspects.
> >
> > 1. The data partition and data distribution of Doris are two independent
> > modules. If you want to transfer data according to the data distribution,
> > the metadata information may be high, including all partition information
> > and bucket information.
> >
> > 2. Currently, Doris supports partition first and then distributed.
> Usually,
> > the two are based on different columns. This will cause the same
> > distribution columns to be on the same machine. For example, if table a
> is
> > partitioned according to the time column, and then the bucket is divided
> > according to the userid row, then the same userid in different partitions
> > will appear on different machines. In extreme cases, the same userid may
> > appear on all machines, which may degenerate into broadcast join when
> > joining the userid column.
> >
> > I think we should be able to support bucket join for tables with group
> > info. If we support all tables, there are still many points to consider.
> >
> > Thanks,
> > Zhao Chun
> >
> >
> > Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午1:08写道：
> >
> > >
> > > Motivation
> > >
> > > At present, Doris support 3 type join: shuffle join, broadcast join,
> > > colocate join.
> > > Except colocate join，another join will lead to a lot of network
> > > consumption.
> > >
> > > For example, there a SQL A join B, the cost of network.
> > >
> > >   *   broadcast join: if table A is divided into three parts，the net
> work
> > > cost is 3B
> > >   *   shuffle join: the network cost is A + B.
> > >
> > > These network consumption not only leads to slow query, but also leads
> to
> > > extra memory consumption during join.
> > >
> > > Each Doris table have disrtribute info, if the join expr hit the
> > > distribute info, we should use the distribute info to reduce the
> network
> > > consumption.
> > >
> > > What is bucket shuffle join
> > >
> > > [image.png]<
> > >
> >
> https://camo.githubusercontent.com/ba3ac7db1e7c983ec0f555d332b1064c69a9dc2a/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d633338336665383461656565313362632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
> > > >
> > >
> > > just like Hive's bucket map join, the picture show how it work. if
> there
> > a
> > > SQL A join B, and the join expr hit the distribute info of A. Bucket
> > > shuffle join only need distribute table B, sent the data to proper
> table
> > A
> > > part. So the network cost is always B.
> > >
> > > So compared with the original join, obviously bucket shuffle join lead
> to
> > > less network cost：
> > >
> > >                  B < min(3B, A + B)
> > >
> > >
> > > It can bring us the following benefits:
> > >
> > >   1.  First, Bucket Shuffle Join reduce the network cost and lead to a
> > > better performance for some join. Especially when the bucket is
> cropped.
> > >
> > >   2.  It does not strongly rely on the mechanism of collocate, so it is
> > > transparent to users. There is no mandatory requirement for data
> > > distribution, which will not lead to data skew.
> > >
> > >   3.  It can provide more query optimization space for join reorder.
> > >
> > > POC of Bucket Shuffle Join
> > >
> > > Now I've implemented a simple Bucket Shuffle join in Doris and test the
> > > performance of it.
> > >
> > > Now, we chose tpcds query 57. The query have 6 join operation, and 4 of
> > > them can hit Bucket shuffle join.
> > >
> > > Origin Doris    Bucket shuffle join
> > > Time Cost       27.7s   16.4s
> > >
> > > It seems to work as well as we expected. I'll do more experiments to
> > > verify its performance in the future
> > >
> > > Implementation
> > >
> > >   1.  First, we should add a partition type in thrift type
> > >
> > >   2.  FE able to plan and sense queries that can be used bucket shuffle
> > > join. send data distribution info to BE
> > >
> > >   3.  BE use the proper hash function to send proper data to proper
> > > instance of BE.
> > >
> > >
> > >       Best Wish
> > >     Happen Lee
> > >
> > >                                                            
> > >
> > >
> >
>

Re: [Proposal] Support Bucket Shuffle Join for Doris

Posted by Lee Happen <Ha...@hotmail.com>.

Hi Zhao Chun

1. It can only be applied to a single partition. It is similar to that when the table joins itself, it should use collocate join in the single partition scenario. The query like TPCDS is not multi partitioned, so the effect will be better.
Doris don't just have partition cutting, but also bucket cutting. With bucket clipping in effect, bucket join can reduce the amount of data sent by the network.
For example, table A has 10 buckets and only 2 buckets after trimming. Then table B only needs to send 1 / 5 of the data. Compared with shuffle join and broadcast join, which need to send much less data.

2. The amount of distributed partition data will not be very complex. The implementation here is consistent with the original. After the list instance is sent, it will know which line to send through the list instance. If the instance is tagged to empty, the data will not be sent which means the bucket is clipped. The size of this list is consistent with the num bucket of partition. The only thing to note is to replace the hash algorithm in BE with CRC32

3. I will post POC profile when I can reuse the test env.


Best Wish,
Happen Lee
________________________________
From: Zhao Chun <zh...@apache.org>
Sent: August 19, 2020 7:47
To: dev@doris.apache.org <de...@doris.apache.org>
Subject: Re: [Proposal] Support Bucket Shuffle Join for Doris

Hi Happen

I think we can only support bucket join of single partition table. But I
have some concerns

1. In real use, most scenes are created with multiple partitions. The
possible scenarios for this optimization are limited.

2. It will make the data stream sender more complex. Because it is
necessary to transfer the whole partition distribution information from Fe,
and the hash mode of data import should be coupled when data hash
calculation.

In addition, can you post the POC profile content? I want to know the
specific reasons for the promotion.

Thanks,
Zhao Chun


Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午2:39写道：

>
> Hi Zhao Chun
>
> Let me explain the two questions you mentioned
>
>
> 1. Doris have data partition and data distribution meta,  a certain bucket
> seq and partition can be located to a tablet. Each tablet may be
> distributed on different machines. if we both consider data partition and
> data distribution. The Network consumption of bucket shuffle join will
> degenerate. More machine more degradation.
>
> So Bucket shuffle join only when a single partition is queried can it play
> its role
>
> 2. Before the query, the partition clipping will be carried out by FE. In
> many query scenarios, queries are limited to a single partition or even a
> few buckets in a single partition. In other words, the better the effect of
> partition cutting, The better the performance of Bucket shuffle join.
>
> So we only need take care of distribution column, but not the data
> partition column.
>
> 3. if we support bucket join for tables with group info. It's not much
> different from colocate join. Most scenarios colocate join do well thing
> than Bucket shuffle join. It also brings the problem of data skew.
>
> I think Bucket Shuffle Join Is a special scene optimization, it is
> transparent to users. If there are multi partition in left table, we still
> can use
> shuffle join, broadcast join like before.
>
>
> Best Wish,
> Happen Lee
> ________________________________
> From: Zhao Chun <zh...@apache.org>
> Sent: August 19, 2020 5:53
> To: dev@doris.apache.org <de...@doris.apache.org>
> Subject: Re: [Proposal] Support Bucket Shuffle Join for Doris
>
> Hi Happen
>
> Good proposal.
>
> If we want to carry out this work, we need to pay attention to several
> aspects.
>
> 1. The data partition and data distribution of Doris are two independent
> modules. If you want to transfer data according to the data distribution,
> the metadata information may be high, including all partition information
> and bucket information.
>
> 2. Currently, Doris supports partition first and then distributed. Usually,
> the two are based on different columns. This will cause the same
> distribution columns to be on the same machine. For example, if table a is
> partitioned according to the time column, and then the bucket is divided
> according to the userid row, then the same userid in different partitions
> will appear on different machines. In extreme cases, the same userid may
> appear on all machines, which may degenerate into broadcast join when
> joining the userid column.
>
> I think we should be able to support bucket join for tables with group
> info. If we support all tables, there are still many points to consider.
>
> Thanks,
> Zhao Chun
>
>
> Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午1:08写道：
>
> >
> > Motivation
> >
> > At present, Doris support 3 type join: shuffle join, broadcast join,
> > colocate join.
> > Except colocate join，another join will lead to a lot of network
> > consumption.
> >
> > For example, there a SQL A join B, the cost of network.
> >
> >   *   broadcast join: if table A is divided into three parts，the net work
> > cost is 3B
> >   *   shuffle join: the network cost is A + B.
> >
> > These network consumption not only leads to slow query, but also leads to
> > extra memory consumption during join.
> >
> > Each Doris table have disrtribute info, if the join expr hit the
> > distribute info, we should use the distribute info to reduce the network
> > consumption.
> >
> > What is bucket shuffle join
> >
> > [image.png]<
> >
> https://camo.githubusercontent.com/ba3ac7db1e7c983ec0f555d332b1064c69a9dc2a/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d633338336665383461656565313362632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
> > >
> >
> > just like Hive's bucket map join, the picture show how it work. if there
> a
> > SQL A join B, and the join expr hit the distribute info of A. Bucket
> > shuffle join only need distribute table B, sent the data to proper table
> A
> > part. So the network cost is always B.
> >
> > So compared with the original join, obviously bucket shuffle join lead to
> > less network cost：
> >
> >                  B < min(3B, A + B)
> >
> >
> > It can bring us the following benefits:
> >
> >   1.  First, Bucket Shuffle Join reduce the network cost and lead to a
> > better performance for some join. Especially when the bucket is cropped.
> >
> >   2.  It does not strongly rely on the mechanism of collocate, so it is
> > transparent to users. There is no mandatory requirement for data
> > distribution, which will not lead to data skew.
> >
> >   3.  It can provide more query optimization space for join reorder.
> >
> > POC of Bucket Shuffle Join
> >
> > Now I've implemented a simple Bucket Shuffle join in Doris and test the
> > performance of it.
> >
> > Now, we chose tpcds query 57. The query have 6 join operation, and 4 of
> > them can hit Bucket shuffle join.
> >
> > Origin Doris    Bucket shuffle join
> > Time Cost       27.7s   16.4s
> >
> > It seems to work as well as we expected. I'll do more experiments to
> > verify its performance in the future
> >
> > Implementation
> >
> >   1.  First, we should add a partition type in thrift type
> >
> >   2.  FE able to plan and sense queries that can be used bucket shuffle
> > join. send data distribution info to BE
> >
> >   3.  BE use the proper hash function to send proper data to proper
> > instance of BE.
> >
> >
> >       Best Wish
> >     Happen Lee
> >
> >                                                            
> >
> >
>

Re: [Proposal] Support Bucket Shuffle Join for Doris

Posted by Zhao Chun <zh...@apache.org>.

Hi Happen

I think we can only support bucket join of single partition table. But I
have some concerns

1. In real use, most scenes are created with multiple partitions. The
possible scenarios for this optimization are limited.

2. It will make the data stream sender more complex. Because it is
necessary to transfer the whole partition distribution information from Fe,
and the hash mode of data import should be coupled when data hash
calculation.

In addition, can you post the POC profile content? I want to know the
specific reasons for the promotion.

Thanks,
Zhao Chun


Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午2:39写道：

>
> Hi Zhao Chun
>
> Let me explain the two questions you mentioned
>
>
> 1. Doris have data partition and data distribution meta,  a certain bucket
> seq and partition can be located to a tablet. Each tablet may be
> distributed on different machines. if we both consider data partition and
> data distribution. The Network consumption of bucket shuffle join will
> degenerate. More machine more degradation.
>
> So Bucket shuffle join only when a single partition is queried can it play
> its role
>
> 2. Before the query, the partition clipping will be carried out by FE. In
> many query scenarios, queries are limited to a single partition or even a
> few buckets in a single partition. In other words, the better the effect of
> partition cutting, The better the performance of Bucket shuffle join.
>
> So we only need take care of distribution column, but not the data
> partition column.
>
> 3. if we support bucket join for tables with group info. It's not much
> different from colocate join. Most scenarios colocate join do well thing
> than Bucket shuffle join. It also brings the problem of data skew.
>
> I think Bucket Shuffle Join Is a special scene optimization, it is
> transparent to users. If there are multi partition in left table, we still
> can use
> shuffle join, broadcast join like before.
>
>
> Best Wish,
> Happen Lee
> ________________________________
> From: Zhao Chun <zh...@apache.org>
> Sent: August 19, 2020 5:53
> To: dev@doris.apache.org <de...@doris.apache.org>
> Subject: Re: [Proposal] Support Bucket Shuffle Join for Doris
>
> Hi Happen
>
> Good proposal.
>
> If we want to carry out this work, we need to pay attention to several
> aspects.
>
> 1. The data partition and data distribution of Doris are two independent
> modules. If you want to transfer data according to the data distribution,
> the metadata information may be high, including all partition information
> and bucket information.
>
> 2. Currently, Doris supports partition first and then distributed. Usually,
> the two are based on different columns. This will cause the same
> distribution columns to be on the same machine. For example, if table a is
> partitioned according to the time column, and then the bucket is divided
> according to the userid row, then the same userid in different partitions
> will appear on different machines. In extreme cases, the same userid may
> appear on all machines, which may degenerate into broadcast join when
> joining the userid column.
>
> I think we should be able to support bucket join for tables with group
> info. If we support all tables, there are still many points to consider.
>
> Thanks,
> Zhao Chun
>
>
> Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午1:08写道：
>
> >
> > Motivation
> >
> > At present, Doris support 3 type join: shuffle join, broadcast join,
> > colocate join.
> > Except colocate join，another join will lead to a lot of network
> > consumption.
> >
> > For example, there a SQL A join B, the cost of network.
> >
> >   *   broadcast join: if table A is divided into three parts，the net work
> > cost is 3B
> >   *   shuffle join: the network cost is A + B.
> >
> > These network consumption not only leads to slow query, but also leads to
> > extra memory consumption during join.
> >
> > Each Doris table have disrtribute info, if the join expr hit the
> > distribute info, we should use the distribute info to reduce the network
> > consumption.
> >
> > What is bucket shuffle join
> >
> > [image.png]<
> >
> https://camo.githubusercontent.com/ba3ac7db1e7c983ec0f555d332b1064c69a9dc2a/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d633338336665383461656565313362632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
> > >
> >
> > just like Hive's bucket map join, the picture show how it work. if there
> a
> > SQL A join B, and the join expr hit the distribute info of A. Bucket
> > shuffle join only need distribute table B, sent the data to proper table
> A
> > part. So the network cost is always B.
> >
> > So compared with the original join, obviously bucket shuffle join lead to
> > less network cost：
> >
> >                  B < min(3B, A + B)
> >
> >
> > It can bring us the following benefits:
> >
> >   1.  First, Bucket Shuffle Join reduce the network cost and lead to a
> > better performance for some join. Especially when the bucket is cropped.
> >
> >   2.  It does not strongly rely on the mechanism of collocate, so it is
> > transparent to users. There is no mandatory requirement for data
> > distribution, which will not lead to data skew.
> >
> >   3.  It can provide more query optimization space for join reorder.
> >
> > POC of Bucket Shuffle Join
> >
> > Now I've implemented a simple Bucket Shuffle join in Doris and test the
> > performance of it.
> >
> > Now, we chose tpcds query 57. The query have 6 join operation, and 4 of
> > them can hit Bucket shuffle join.
> >
> > Origin Doris    Bucket shuffle join
> > Time Cost       27.7s   16.4s
> >
> > It seems to work as well as we expected. I'll do more experiments to
> > verify its performance in the future
> >
> > Implementation
> >
> >   1.  First, we should add a partition type in thrift type
> >
> >   2.  FE able to plan and sense queries that can be used bucket shuffle
> > join. send data distribution info to BE
> >
> >   3.  BE use the proper hash function to send proper data to proper
> > instance of BE.
> >
> >
> >       Best Wish
> >     Happen Lee
> >
> >                                                            
> >
> >
>

Re: [Proposal] Support Bucket Shuffle Join for Doris

Posted by Lee Happen <Ha...@hotmail.com>.

Hi Zhao Chun

Let me explain the two questions you mentioned


1. Doris have data partition and data distribution meta,  a certain bucket seq and partition can be located to a tablet. Each tablet may be distributed on different machines. if we both consider data partition and data distribution. The Network consumption of bucket shuffle join will degenerate. More machine more degradation.

So Bucket shuffle join only when a single partition is queried can it play its role

2. Before the query, the partition clipping will be carried out by FE. In many query scenarios, queries are limited to a single partition or even a few buckets in a single partition. In other words, the better the effect of partition cutting, The better the performance of Bucket shuffle join.

So we only need take care of distribution column, but not the data partition column.

3. if we support bucket join for tables with group info. It's not much different from colocate join. Most scenarios colocate join do well thing than Bucket shuffle join. It also brings the problem of data skew.

I think Bucket Shuffle Join Is a special scene optimization, it is transparent to users. If there are multi partition in left table, we still can use
shuffle join, broadcast join like before.


Best Wish,
Happen Lee
________________________________
From: Zhao Chun <zh...@apache.org>
Sent: August 19, 2020 5:53
To: dev@doris.apache.org <de...@doris.apache.org>
Subject: Re: [Proposal] Support Bucket Shuffle Join for Doris

Hi Happen

Good proposal.

If we want to carry out this work, we need to pay attention to several
aspects.

1. The data partition and data distribution of Doris are two independent
modules. If you want to transfer data according to the data distribution,
the metadata information may be high, including all partition information
and bucket information.

2. Currently, Doris supports partition first and then distributed. Usually,
the two are based on different columns. This will cause the same
distribution columns to be on the same machine. For example, if table a is
partitioned according to the time column, and then the bucket is divided
according to the userid row, then the same userid in different partitions
will appear on different machines. In extreme cases, the same userid may
appear on all machines, which may degenerate into broadcast join when
joining the userid column.

I think we should be able to support bucket join for tables with group
info. If we support all tables, there are still many points to consider.

Thanks,
Zhao Chun


Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午1:08写道：

>
> Motivation
>
> At present, Doris support 3 type join: shuffle join, broadcast join,
> colocate join.
> Except colocate join，another join will lead to a lot of network
> consumption.
>
> For example, there a SQL A join B, the cost of network.
>
>   *   broadcast join: if table A is divided into three parts，the net work
> cost is 3B
>   *   shuffle join: the network cost is A + B.
>
> These network consumption not only leads to slow query, but also leads to
> extra memory consumption during join.
>
> Each Doris table have disrtribute info, if the join expr hit the
> distribute info, we should use the distribute info to reduce the network
> consumption.
>
> What is bucket shuffle join
>
> [image.png]<
> https://camo.githubusercontent.com/ba3ac7db1e7c983ec0f555d332b1064c69a9dc2a/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d633338336665383461656565313362632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
> >
>
> just like Hive's bucket map join, the picture show how it work. if there a
> SQL A join B, and the join expr hit the distribute info of A. Bucket
> shuffle join only need distribute table B, sent the data to proper table A
> part. So the network cost is always B.
>
> So compared with the original join, obviously bucket shuffle join lead to
> less network cost：
>
>                  B < min(3B, A + B)
>
>
> It can bring us the following benefits:
>
>   1.  First, Bucket Shuffle Join reduce the network cost and lead to a
> better performance for some join. Especially when the bucket is cropped.
>
>   2.  It does not strongly rely on the mechanism of collocate, so it is
> transparent to users. There is no mandatory requirement for data
> distribution, which will not lead to data skew.
>
>   3.  It can provide more query optimization space for join reorder.
>
> POC of Bucket Shuffle Join
>
> Now I've implemented a simple Bucket Shuffle join in Doris and test the
> performance of it.
>
> Now, we chose tpcds query 57. The query have 6 join operation, and 4 of
> them can hit Bucket shuffle join.
>
> Origin Doris    Bucket shuffle join
> Time Cost       27.7s   16.4s
>
> It seems to work as well as we expected. I'll do more experiments to
> verify its performance in the future
>
> Implementation
>
>   1.  First, we should add a partition type in thrift type
>
>   2.  FE able to plan and sense queries that can be used bucket shuffle
> join. send data distribution info to BE
>
>   3.  BE use the proper hash function to send proper data to proper
> instance of BE.
>
>
>       Best Wish
>     Happen Lee
>
>                                                            
>
>

Re: [Proposal] Support Bucket Shuffle Join for Doris

Posted by Zhao Chun <zh...@apache.org>.

Hi Happen

Good proposal.

If we want to carry out this work, we need to pay attention to several
aspects.

1. The data partition and data distribution of Doris are two independent
modules. If you want to transfer data according to the data distribution,
the metadata information may be high, including all partition information
and bucket information.

2. Currently, Doris supports partition first and then distributed. Usually,
the two are based on different columns. This will cause the same
distribution columns to be on the same machine. For example, if table a is
partitioned according to the time column, and then the bucket is divided
according to the userid row, then the same userid in different partitions
will appear on different machines. In extreme cases, the same userid may
appear on all machines, which may degenerate into broadcast join when
joining the userid column.

I think we should be able to support bucket join for tables with group
info. If we support all tables, there are still many points to consider.

Thanks,
Zhao Chun


Lee Happen <Ha...@hotmail.com> 于2020年8月19日周三 下午1:08写道：

>
> Motivation
>
> At present, Doris support 3 type join: shuffle join, broadcast join,
> colocate join.
> Except colocate join，another join will lead to a lot of network
> consumption.
>
> For example, there a SQL A join B, the cost of network.
>
>   *   broadcast join: if table A is divided into three parts，the net work
> cost is 3B
>   *   shuffle join: the network cost is A + B.
>
> These network consumption not only leads to slow query, but also leads to
> extra memory consumption during join.
>
> Each Doris table have disrtribute info, if the join expr hit the
> distribute info, we should use the distribute info to reduce the network
> consumption.
>
> What is bucket shuffle join
>
> [image.png]<
> https://camo.githubusercontent.com/ba3ac7db1e7c983ec0f555d332b1064c69a9dc2a/68747470733a2f2f75706c6f61642d696d616765732e6a69616e7368752e696f2f75706c6f61645f696d616765732f383535323230312d633338336665383461656565313362632e706e673f696d6167654d6f6772322f6175746f2d6f7269656e742f7374726970253743696d61676556696577322f322f772f31323430
> >
>
> just like Hive's bucket map join, the picture show how it work. if there a
> SQL A join B, and the join expr hit the distribute info of A. Bucket
> shuffle join only need distribute table B, sent the data to proper table A
> part. So the network cost is always B.
>
> So compared with the original join, obviously bucket shuffle join lead to
> less network cost：
>
>                  B < min(3B, A + B)
>
>
> It can bring us the following benefits:
>
>   1.  First, Bucket Shuffle Join reduce the network cost and lead to a
> better performance for some join. Especially when the bucket is cropped.
>
>   2.  It does not strongly rely on the mechanism of collocate, so it is
> transparent to users. There is no mandatory requirement for data
> distribution, which will not lead to data skew.
>
>   3.  It can provide more query optimization space for join reorder.
>
> POC of Bucket Shuffle Join
>
> Now I've implemented a simple Bucket Shuffle join in Doris and test the
> performance of it.
>
> Now, we chose tpcds query 57. The query have 6 join operation, and 4 of
> them can hit Bucket shuffle join.
>
> Origin Doris    Bucket shuffle join
> Time Cost       27.7s   16.4s
>
> It seems to work as well as we expected. I'll do more experiments to
> verify its performance in the future
>
> Implementation
>
>   1.  First, we should add a partition type in thrift type
>
>   2.  FE able to plan and sense queries that can be used bucket shuffle
> join. send data distribution info to BE
>
>   3.  BE use the proper hash function to send proper data to proper
> instance of BE.
>
>
>       Best Wish
>     Happen Lee
>
>                                                            
>
>