You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Mr...@rbs.com.INVALID on 2016/11/04 08:53:07 UTC

Drill Join query optimization

Hello,

I am working on a proof of concept to optimize the performance of the join queries executed through drill. The underlying storage is a NO-SQL based database - Mongo DB. The time it takes to return the result of the join query is very high (46 seconds). Upon further analysis, as per the physical plan of the query it is observed that both the left side (15 lakh records) and right side table (13 lakh) are fully scanned and it takes 24 seconds and 20 seconds respectively for the same.

Here is the query -

select ta.[SOME_COLUMN] from mongo.Test.TABLEA ta INNER JOIN mongo.Test.TABLEB ta ON ta.Id = tb.Id and ta.Id ='123'

Records in table A : 15 Lakhs

Records in table B : 13 Lakhs

Filter condition : Id is indexed field in both the tables (ascending)

Drill Plan shows Hash-join being performed as shown in the plan (as attached above)

I have the following questions -

1)     Why does drill fetch all the records into the memory even though filter condition on the indexed column was provided for one of the table ?

2)     If the Drill planner/optimizer is intelligent then records could have been filtered on the second table as well based on the join condition (to reduce the dataset and hence result in faster execution time)

3)     Are there any performance tuning options available to bring down the query response time and generate a better query plan ?



I would appreciate if you can provide me pointers/directions or answers to the above questions.

Please feel free if you need any additional information.

Thanks,

Mridul





*********************************************************************************** 
The Royal Bank of Scotland plc. Registered in Scotland No 90312. 
Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
Authorised by the Prudential Regulation Authority and regulated 
by the Financial Conduct Authority and Prudential Regulation Authority. 
The Royal Bank of Scotland N.V. is authorised and regulated by the 
De Nederlandsche Bank and has its seat at Amsterdam, the 
Netherlands, and is registered in the Commercial Register under 
number 33002587. Registered Office: Gustav Mahlerlaan 350, 
Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and 
The Royal Bank of Scotland plc are authorised to act as agent for each 
other in certain jurisdictions. 
  
This e-mail message is confidential and for use by the addressee only. 
If the message is received by anyone other than the addressee, please 
return the message to the sender by replying to it and then delete the 
message from your computer. Internet e-mails are not necessarily 
secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
N.V. including its affiliates ("RBS group") does not accept responsibility 
for changes made to this message after it was sent. For the protection
of RBS group and its clients and customers, and in compliance with
regulatory requirements, the contents of both incoming and outgoing
e-mail communications, which could include proprietary information and
Non-Public Personal Information, may be read by authorised persons
within RBS group other than the intended recipient(s). 

Whilst all reasonable care has been taken to avoid the transmission of 
viruses, it is the responsibility of the recipient to ensure that the onward 
transmission, opening or use of this message and any attachments will 
not adversely affect its systems or data. No responsibility is accepted 
by the RBS group in this regard and the recipient should carry out such 
virus and other checks as it considers appropriate. 

Visit our website at www.rbs.com 
***********************************************************************************  

Re: Drill Join query optimization

Posted by Jinfeng Ni <jn...@apache.org>.
I'm not familiar with Mongo storage plugin. But looking at the code,
seems there is a MongoPushDownFilterForScan rule, which should be able
to push the filter ta.Id ='123' into TableA. If your query plan does
not show this filter is pushed down into Mongo Scan operator, there
might be a bug in this rule.

For the filter pushdown on TableB side, it's related to filter created
from transitive closure. Looks like Drill does not enable such rule.
As such, we do not create  tb.Id='123', from ta.Id = tb.Id and
ta.id='123'. Such rule should be independent from Mongo storage
plugin, and should work for any storage plugin.




[1] https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/main/java/org/apache/drill/exec/store/mongo/MongoPushDownFilterForScan.java

On Fri, Nov 4, 2016 at 4:28 AM, Khurram Faraaz <kf...@maprtech.com> wrote:
> What version of Drill are you on ?
> Can you please share the query plan for your query ?
> Please share the table definitions ?
> Is this a standalone setup or cluster of Drillbits ?
>
>
> On Fri, Nov 4, 2016 at 2:23 PM, <Mr...@rbs.com.invalid> wrote:
>
>> Hello,
>>
>> I am working on a proof of concept to optimize the performance of the
>> join queries executed through drill. The underlying storage is a NO-SQL
>> based database - Mongo DB. The time it takes to return the result of the
>> join query is very high (46 seconds). Upon further analysis, as per the
>> physical plan of the query it is observed that both the left side (15 lakh
>> records) and right side table (13 lakh) are fully scanned and it takes 24
>> seconds and 20 seconds respectively for the same.
>>
>> Here is the query -
>>
>> select ta.[SOME_COLUMN] from mongo.Test.TABLEA ta INNER JOIN
>> mongo.Test.TABLEB ta ON ta.Id = tb.Id and ta.Id ='123'
>>
>> Records in table A : 15 Lakhs
>>
>> Records in table B : 13 Lakhs
>>
>> Filter condition : Id is indexed field in both the tables (ascending)
>>
>> Drill Plan shows Hash-join being performed as shown in the plan (as
>> attached above)
>>
>> I have the following questions -
>>
>> 1)     Why does drill fetch all the records into the memory even though
>> filter condition on the indexed column was provided for one of the table ?
>>
>> 2)     If the Drill planner/optimizer is intelligent then records could
>> have been filtered on the second table as well based on the join condition
>> (to reduce the dataset and hence result in faster execution time)
>>
>> 3)     Are there any performance tuning options available to bring down
>> the query response time and generate a better query plan ?
>>
>>
>>
>> I would appreciate if you can provide me pointers/directions or answers to
>> the above questions.
>>
>> Please feel free if you need any additional information.
>>
>> Thanks,
>>
>> Mridul
>>
>>
>>
>>
>>
>>
>>
>>
>> ***********************************************************************************
>>
>> The Royal Bank of Scotland plc. Registered in Scotland No 90312.
>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
>> Authorised by the Prudential Regulation Authority and regulated
>> by the Financial Conduct Authority and Prudential Regulation Authority.
>> The Royal Bank of Scotland N.V. is authorised and regulated by the
>> De Nederlandsche Bank and has its seat at Amsterdam, the
>> Netherlands, and is registered in the Commercial Register under
>> number 33002587. Registered Office: Gustav Mahlerlaan 350,
>> Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and
>> The Royal Bank of Scotland plc are authorised to act as agent for each
>> other in certain jurisdictions.
>>
>> This e-mail message is confidential and for use by the addressee only.
>> If the message is received by anyone other than the addressee, please
>> return the message to the sender by replying to it and then delete the
>> message from your computer. Internet e-mails are not necessarily
>> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland
>> N.V. including its affiliates ("RBS group") does not accept responsibility
>> for changes made to this message after it was sent. For the protection
>> of RBS group and its clients and customers, and in compliance with
>> regulatory requirements, the contents of both incoming and outgoing
>> e-mail communications, which could include proprietary information and
>> Non-Public Personal Information, may be read by authorised persons
>> within RBS group other than the intended recipient(s).
>>
>> Whilst all reasonable care has been taken to avoid the transmission of
>> viruses, it is the responsibility of the recipient to ensure that the
>> onward
>> transmission, opening or use of this message and any attachments will
>> not adversely affect its systems or data. No responsibility is accepted
>> by the RBS group in this regard and the recipient should carry out such
>> virus and other checks as it considers appropriate.
>>
>> Visit our website at www.rbs.com
>> ***********************************************************************************
>>
>>

Re: Drill Join query optimization

Posted by Khurram Faraaz <kf...@maprtech.com>.
What version of Drill are you on ?
Can you please share the query plan for your query ?
Please share the table definitions ?
Is this a standalone setup or cluster of Drillbits ?


On Fri, Nov 4, 2016 at 2:23 PM, <Mr...@rbs.com.invalid> wrote:

> Hello,
>
> I am working on a proof of concept to optimize the performance of the
> join queries executed through drill. The underlying storage is a NO-SQL
> based database - Mongo DB. The time it takes to return the result of the
> join query is very high (46 seconds). Upon further analysis, as per the
> physical plan of the query it is observed that both the left side (15 lakh
> records) and right side table (13 lakh) are fully scanned and it takes 24
> seconds and 20 seconds respectively for the same.
>
> Here is the query -
>
> select ta.[SOME_COLUMN] from mongo.Test.TABLEA ta INNER JOIN
> mongo.Test.TABLEB ta ON ta.Id = tb.Id and ta.Id ='123'
>
> Records in table A : 15 Lakhs
>
> Records in table B : 13 Lakhs
>
> Filter condition : Id is indexed field in both the tables (ascending)
>
> Drill Plan shows Hash-join being performed as shown in the plan (as
> attached above)
>
> I have the following questions -
>
> 1)     Why does drill fetch all the records into the memory even though
> filter condition on the indexed column was provided for one of the table ?
>
> 2)     If the Drill planner/optimizer is intelligent then records could
> have been filtered on the second table as well based on the join condition
> (to reduce the dataset and hence result in faster execution time)
>
> 3)     Are there any performance tuning options available to bring down
> the query response time and generate a better query plan ?
>
>
>
> I would appreciate if you can provide me pointers/directions or answers to
> the above questions.
>
> Please feel free if you need any additional information.
>
> Thanks,
>
> Mridul
>
>
>
>
>
>
>
>
> ***********************************************************************************
>
> The Royal Bank of Scotland plc. Registered in Scotland No 90312.
> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
> Authorised by the Prudential Regulation Authority and regulated
> by the Financial Conduct Authority and Prudential Regulation Authority.
> The Royal Bank of Scotland N.V. is authorised and regulated by the
> De Nederlandsche Bank and has its seat at Amsterdam, the
> Netherlands, and is registered in the Commercial Register under
> number 33002587. Registered Office: Gustav Mahlerlaan 350,
> Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and
> The Royal Bank of Scotland plc are authorised to act as agent for each
> other in certain jurisdictions.
>
> This e-mail message is confidential and for use by the addressee only.
> If the message is received by anyone other than the addressee, please
> return the message to the sender by replying to it and then delete the
> message from your computer. Internet e-mails are not necessarily
> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland
> N.V. including its affiliates ("RBS group") does not accept responsibility
> for changes made to this message after it was sent. For the protection
> of RBS group and its clients and customers, and in compliance with
> regulatory requirements, the contents of both incoming and outgoing
> e-mail communications, which could include proprietary information and
> Non-Public Personal Information, may be read by authorised persons
> within RBS group other than the intended recipient(s).
>
> Whilst all reasonable care has been taken to avoid the transmission of
> viruses, it is the responsibility of the recipient to ensure that the
> onward
> transmission, opening or use of this message and any attachments will
> not adversely affect its systems or data. No responsibility is accepted
> by the RBS group in this regard and the recipient should carry out such
> virus and other checks as it considers appropriate.
>
> Visit our website at www.rbs.com
> ***********************************************************************************
>
>