You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by sreeparna bhabani <bh...@gmail.com> on 2020/04/22 18:31:05 UTC

Suggestion needed for UNION ALL performance in Apache drill

Hi Team,

I reach out to you for a specific problem regarding UNION ALL. There is one
UNION ALL statement which combines 2 queries. The individual queries are
taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is 1.17.

Please help to suggest how to improve this UNION ALL performance. We are
using parquet file.

Thanks,
Sreeparna Bhabani

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by sreeparna bhabani <bh...@gmail.com>.
Hi Paul and Team,

I reach out to you again for suggestion on the UNION ALL performance
between Parquet and database.

Observation : the query is not parallelised in multiple drillbits when we
select from database. But the same number of rows are parallelised when we
select from Parquet. This brings down the performance of UNION ALL between
database and Parquet.

I have shared few details in trailing mail. Please let me know if you need
any other specific information.

Appreciate your help on this.

Thanks in advance,
Sreeparna Bhabani




On Wed, 6 May 2020, 16:42 sreeparna bhabani, <bh...@gmail.com>
wrote:

> Hi Paul,
>
> Thanks for the analysis you did. The issue occurs when we UNION ALL
> between Parquet and database. Here find below more details on the scenario-
>
> The size of the folder-
> 849 MB
> The total number of parquet files under this folder - 19
> Total number of rows in this folder - 25245 K
> Selected rows from Parquet after filter- 169 K
> Selected rows from database - 50
>
> I have tested the select query from database with same number of rows
> (16402 K). But it is not parallelized for database query. In fact I tried
> DB UNION ALL DB also. But it is not parallelized ( each of the DB query has
> 109 K rows).
>
> I tried "Parquet JOIN database" for testing purpose to check whether it is
> parallelized during joining. Although the JDBC_SCAN was not parallelized,
> PARQUET_ROW_GROUP_SCAN parallelized here.
>
> FYI - The planner.slice_target = 100000
>
> I have raised one JIRA ticket. Please find the same -
> <https://issues.apache.org/jira/browse/DRILL-7720>
> https://issues.apache.org/jira/browse/DRILL-7720
>
> The query we used-
> SELECT column1, column2,...column25
> FROM    dfs.`root`.`Parquet`
> WHERE column1 = <> AND column2 = <> AND column3 = <>
> UNION ALL
> SELECT  column1, column2,...column25
> FROM    db.`Database`
> WHERE column1 = <> AND column2 = <> AND column3 = <>
>
> Below is the Plan after redacted names-
> 00-00 Screen
> 00-01 Project(Column1=[$0],...,Column25=[$24])
> 00-02 UnionExchange
> 01-01 UnionAll(all=[true])
> 01-03 Project (Column1=[$0],...,Column25=[$24])
> 01-05 SelectionVectorRemover
> 01-06
> Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))])
> 01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>,
> numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false,
> filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] )
> 01-02  Project(Column1=[$0],...,Column25=[$24])
> 01-04 Jdbc()
>
> json-
> {
> "head" : {
> "version":1,
> "generator" : {"type":"ExplainHandler","info":""},
> "type" : "APACHE_DRILL_PHYSICAL","options" :
> [],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC"
> },
> "graph":[
> {"pop":"jdbc-scan"},
> {"pop":"project","initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":{memoryCost}:4194304,"outputRowCount":1},
> {"pop":"parquet-scan",
> "cost":{"memoryCost":4194304,"outputRowCount":1.2287038E7} },
> {"pop":"filter" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
> },
> {"pop":"selection-vector-remover" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":{memoryCost}:4194304,"outputRowCount":41468.75
> }  ,
> {"pop":"project" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
> }
> {"pop":"union-all" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
> }    ,
> {"pop":"union-exchange" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75
> } }    ,
> {"pop":"project" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75},"outputProj":true
> },
> {"pop":"screen" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
> }
> ]
> }
>
> Please let me know if any other information you need.
>
> Thanks,
> Sreeparna Bhabani
>
> On Wed, May 6, 2020 at 2:05 PM sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
>>
>> ---------- Forwarded message ---------
>> From: Paul Rogers <pa...@yahoo.com>
>> Date: Tue, 5 May 2020, 13:21
>> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
>> To: <us...@drill.apache.org>
>> Cc: Sreeparna Bhabani <bh...@gmail.com>, <ar...@gmail.com>,
>> <na...@gmail.com>
>>
>>
>> Hi Sreeparna,
>>
>>
>> Thanks much for digging into the details. SQL is pretty complex and
>> things don't always work as we might expect.
>>
>>
>> The first question is: which plan is correct: the (Parquet UNION ALL DB)
>> plan or the (Parquet UNION ALL Parquet) plan? I tried poking around, and
>> got no definitive answer on whether UNION ALL implies ordering. On the
>> one hand, the parts of the standards that I could find didn't seem to imply
>> ordering. A UNION (no ALL) can't imply ordering since it essentially
>> does an anti-join which may be hash-partitioned. But, a StackOverflow post
>> suggested that there is an implied ordering in the case of an ORDER BY on
>> the sub-queries:
>>
>>
>> (SELECT ... ORDER BY ...)
>>
>> UNION AL
>>
>> (SELECT ... ORDER BY ...)
>>
>>
>> That is, if we can sort each sub-query, then doing so only makes sense if
>> all results of the first are returned before any results of the second.
>> (Have not checked if the above is valid in Drill. Even if it was, the
>> planner should handle the above as a special case.)
>>
>>
>> So, let's assume that the (Parquet UNION ALL Parquet) case is the correct
>> behavior. Then, we can speculate that the planner is getting confused
>> somehow in the mixed case. Each data source (file for Parquet, JDBC for the
>> DB) parallelizes independently. Each decides it needs just one fragment.
>> Somehow the planner must be saying, "well, if they both only want one
>> fragment, let's run the whole query in a single fragment."
>>
>>
>> Perhaps the decision is based on row count and the planner somehow thinks
>> the row counts will be small for one or both of the queries. In fact, what
>> happens if you do a query with (DB UNION ALL DB)? Drill's ability to
>> estimate row counts is poor, especially from JDBC. Perhaps the planner is
>> guessing the tables are small and parallelizing is unnecessary.
>>
>>
>> My advice is to file a JIRA ticket with as much detail as you can
>> provide. Certainly the information from your-email. Ideally, also the query
>> (with names redacted if necessary.) Also, the JSON query plan (obtained by
>> using EXPLAIN PLAN FOR), again with names redacted if necessary.
>>
>>
>> With that info, we can dig a bit deeper to determine why the two cases
>> come out differently.
>>
>>
>> Thanks,
>>
>> - Paul
>>
>>
>>
>> On Monday, May 4, 2020, 9:39:12 AM PDT, sreeparna bhabani <
>> bhabani.sreeparna@gmail.com> wrote:
>>
>>
>> Hi Team,
>>
>> After further checking on this UNION ALL, I found that UNION ALL
>> (between Parquet and database) behaves as expected with limited number of
>> rows and columns. But for a larger Parquet file and higher number of
>> selected rows and columns, the UNION ALL takes much higher time than sum
>> of
>> the same of individual Parquet and DB Query.
>>
>> As per the analysis, it looks like the source of this issue is-
>> Although we are using distributed mode, the UNION ALL query is executed
>> only on 1 NODE in case of Parquet UNION ALL DB. It is not distributed and
>> parallelized in multiple nodes.
>>
>> Whereas, for individual query or UNION ALL between same type datasets
>> (Parquet + Parquet) it is getting distributed in 2 NODES.
>>
>> Do you have any finding / idea on this ?
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>> On Tue, Apr 28, 2020 at 9:00 PM sreeparna bhabani <
>> bhabani.sreeparna@gmail.com> wrote:
>>
>> > Hi Paul Team,
>> >
>> > Please check the observation mentioned in the  below Jira where we found
>> > that UNION ALL query is not parallelized between multiple nodes when
>> there
>> > are 2 types dataset (Parquet and Database). But it is parallelized if we
>> > query individual Parquet file.
>> >
>> > Is there any way to enforce parallel execution in multiple nodes ?
>> >
>> > Thanks,
>> > Sreeparna Bhabani
>> >
>> >
>> > On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <
>> bhabani.sreeparna@gmail.com>
>> > wrote:
>> >
>> >>
>> >> Hi Paul and Team,
>> >>
>> >> As you suggested I have created a Jira ticket which is  -
>> >> https://issues.apache.org/jira/browse/DRILL-7720.
>> >> I have mentioned details in the Jira you asked. Please have a look. As
>> >> the data is sensitive, I am trying to create dummy dataset. Will
>> >> provide once it is ready.
>> >>
>> >> Thanks,
>> >> Sreeparna Bhabani
>> >>
>> >> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
>> >> bhabani.sreeparna@gmail.com> wrote:
>> >>
>> >>>
>> >>> ---------- Forwarded message ---------
>> >>> From: Paul Rogers <pa...@yahoo.com>
>> >>> Date: Thu, 23 Apr 2020, 23:59
>> >>> Subject: Re: Suggestion needed for UNION ALL performance in Apache
>> drill
>> >>> To: <us...@drill.apache.org>, sreeparna bhabani <
>> >>> bhabani.sreeparna@gmail.com>
>> >>> Cc: <ar...@gmail.com>, <na...@gmail.com>
>> >>>
>> >>>
>> >>> Hi Sreeparna,
>> >>>
>> >>>
>> >>> As suggested in the earlier e-mail, we would not expect to see
>> different
>> >>> performance in UNION ALL than in a simple scan. Clearly you've found
>> some
>> >>> kind of issue. The next step is to investigate that issue, which is a
>> bit
>> >>> hard to do over e-mail.
>> >>>
>> >>>
>> >>> Please file a JIRA ticket to describe the issue and provide a
>> >>> reproducible test case including query and data. If your data is
>> sensitive,
>> >>> please create a dummy data set, or use the provided TPC-H data set to
>> >>> recreate the issue. We can then take a look to see what might be
>> happening.
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> - Paul
>> >>>
>> >>>
>> >>>
>> >>> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
>> >>> bhabani.sreeparna@gmail.com> wrote:
>> >>>
>> >>>
>> >>> Hi Team,
>> >>>
>> >>> In addition to the below mail I have another finding. Please consider
>> >>> below scenarios. The first 2 scenarios are giving expected results in
>> terms
>> >>> of performance. But we are not getting expected performance for 3rd
>> >>> scenario which is UNION ALL with 2 different types of datasets.
>> >>>
>> >>> *Scenario 1- Parquet UNION ALL Parquet*
>> >>> Individual execution time of 1st query - 5 secs
>> >>> Individual execution time of 2nd query - 5 secs
>> >>> UNION ALL of both queries execution time - 10 secs
>> >>>
>> >>> *Scenario 2 - DB query UNION ALL DB* *query*
>> >>> Individual execution time of 1st query - 5 secs
>> >>> Individual execution time of 2nd query - 5 secs
>> >>> UNION ALL of both queries execution time - 10 secs
>> >>>
>> >>> *Scenario 3 - Parquet UNION ALL DB query*
>> >>> Individual execution time of 1st query - 5 secs
>> >>> Individual execution time of 2nd query - 1 sec
>> >>> UNION ALL execution time - 20 secs
>> >>> Ideally the execution time should not be more than 6 secs.
>> >>>
>> >>> May I request you to check whether the UNION ALL performance of 3rd
>> >>> scenario is expected with different dataset types.
>> >>>
>> >>> Please suggest if there is any specific way to bring down the
>> execution
>> >>> time of 3rd scenario.
>> >>>
>> >>> Thanks in advance.
>> >>>
>> >>> Sreeparna Bhabani
>> >>>
>> >>>
>> >>>
>> >>> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <
>> >>> bhabani.sreeparna@gmail.com> wrote:
>> >>>
>> >>> Hi Team,
>> >>>
>> >>> Apart from the below issue I have another question.
>> >>>
>> >>> Is there any relation between number of row groups and performance ?
>> >>>
>> >>> In the below query the number of files is 13 and numRowGroups is 69.
>> Is
>> >>> the UNION ALL takes more time if the number of rowgroup is high like
>> that.
>> >>>
>> >>> Please note that the individual Parquet query takes 6 secs. But UNION
>> >>> ALL takes 20 secs. Details are given in trail mail.
>> >>>
>> >>> Thanks,
>> >>> Sreeparna Bhabani
>> >>>
>> >>> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <dishari.5681@gmail.com
>> >
>> >>> wrote:
>> >>>
>> >>> Hi Paul,
>> >>>
>> >>> Please find the details below. We are using 2 drillbits. Heap memory
>> 16
>> >>> G, Max direct memory 32 G. One query selects from Parquet. Another one
>> >>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL.
>> There
>> >>> is not sorting.
>> >>>
>> >>> Single parquet query-
>> >>> Total execution time - 6.6 sec
>> >>> Scan time - 0.152 sec
>> >>> Screen wait time - 5.3 sec
>> >>>
>> >>> Single JDBC query-
>> >>> Total execution time - 0.261 sec
>> >>> JDBC scan - 0.152 sec
>> >>> Screen wait - 0.004 sec
>> >>>
>> >>>
>> >>> Union all query -
>> >>> Execution time - 21. 118 sec
>> >>> Screen wait time - 5.351 sec
>> >>> Parquet scan - 15.368 sec
>> >>> Unordered receiver wait time - 14.41 sec
>> >>>
>> >>> Thanks,
>> >>> Sreeparna Bhabani
>> >>>
>> >>>
>> >>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:
>> >>>
>> >>> Hi Sreeparna,
>> >>>
>> >>>
>> >>> The short answer is it *should* work: a UNION ALL is simply an append.
>> >>> (Be sure you are not using a plain UNION as that needs to do more
>> work to
>> >>> remove duplicates.)
>> >>>
>> >>>
>> >>> Since you are seeing unexpected behavior, we may have some kind of
>> issue
>> >>> to investigate and perhaps fix. Always hard to do over e-mail, but
>> let's
>> >>> see what we can do.
>> >>>
>> >>>
>> >>> The first question is to understand the full query: are you doing more
>> >>> than a simple scan of two files and a UNION ALL? Are there sorts or
>> joins
>> >>> involved?
>> >>>
>> >>>
>> >>> The best place to start to investigate performance issues is the query
>> >>> profile, which it looks like you are doing. What is the time for the
>> scans
>> >>> if you run each of the two scans separately? You said that they take
>> 8 and
>> >>> 1 seconds. Is that for the whole query or just the scan operators?
>> >>>
>> >>>
>> >>> Then, when you run the UNION ALL, again looking at the scan operators,
>> >>> is there any difference in run times? If the scans take longer, that
>> is one
>> >>> thing to investigate. If the scans take the same amount of time, what
>> other
>> >>> operator(s) are taking the rest of the time? Your note suggests that
>> it is
>> >>> the scan taking the time. But, there should be two scan operators:
>> one for
>> >>> each file. How is the time divided between them?
>> >>>
>> >>>
>> >>> How large are the data files? Using what storage system? How many
>> >>> Drillbits? How much memory?
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> - Paul
>> >>>
>> >>>
>> >>>
>> >>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
>> >>> bhabani.sreeparna@gmail.com> wrote:
>> >>>
>> >>>
>> >>> Hi Team,
>> >>>
>> >>> I reach out to you for a specific problem regarding UNION ALL. There
>> is
>> >>> one
>> >>> UNION ALL statement which combines 2 queries. The individual queries
>> are
>> >>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
>> >>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
>> >>> 1.17.
>> >>>
>> >>> Please help to suggest how to improve this UNION ALL performance. We
>> are
>> >>> using parquet file.
>> >>>
>> >>> Thanks,
>> >>> Sreeparna Bhabani
>> >>>
>> >>>
>> >>
>> >> --
>> >>
>> >> Thanks n Regards,
>>
>> >> *Sreeparna Bhabani*
>> >>
>> >
>>
>> --
>>
>> Thanks n Regards,
>> *Sreeparna Bhabani*
>>
>
>
> --
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by sreeparna bhabani <bh...@gmail.com>.
Hi Paul,

Thanks for the analysis you did. The issue occurs when we UNION ALL between
Parquet and database. Here find below more details on the scenario-

The size of the folder-
849 MB
The total number of parquet files under this folder - 19
Total number of rows in this folder - 25245 K
Selected rows from Parquet after filter- 169 K
Selected rows from database - 50

I have tested the select query from database with same number of rows
(16402 K). But it is not parallelized for database query. In fact I tried
DB UNION ALL DB also. But it is not parallelized ( each of the DB query has
109 K rows).

I tried "Parquet JOIN database" for testing purpose to check whether it is
parallelized during joining. Although the JDBC_SCAN was not parallelized,
PARQUET_ROW_GROUP_SCAN parallelized here.

FYI - The planner.slice_target = 100000

I have raised one JIRA ticket. Please find the same -
<https://issues.apache.org/jira/browse/DRILL-7720>
https://issues.apache.org/jira/browse/DRILL-7720

The query we used-
SELECT column1, column2,...column25
FROM    dfs.`root`.`Parquet`
WHERE column1 = <> AND column2 = <> AND column3 = <>
UNION ALL
SELECT  column1, column2,...column25
FROM    db.`Database`
WHERE column1 = <> AND column2 = <> AND column3 = <>

Below is the Plan after redacted names-
00-00 Screen
00-01 Project(Column1=[$0],...,Column25=[$24])
00-02 UnionExchange
01-01 UnionAll(all=[true])
01-03 Project (Column1=[$0],...,Column25=[$24])
01-05 SelectionVectorRemover
01-06
Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))])
01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>,
numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false,
filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] )
01-02  Project(Column1=[$0],...,Column25=[$24])
01-04 Jdbc()

json-
{
"head" : {
"version":1,
"generator" : {"type":"ExplainHandler","info":""},
"type" : "APACHE_DRILL_PHYSICAL","options" :
[],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC"
},
"graph":[
{"pop":"jdbc-scan"},
{"pop":"project","initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{memoryCost}:4194304,"outputRowCount":1},
{"pop":"parquet-scan",
"cost":{"memoryCost":4194304,"outputRowCount":1.2287038E7} },
{"pop":"filter" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
},
{"pop":"selection-vector-remover" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{memoryCost}:4194304,"outputRowCount":41468.75
}  ,
{"pop":"project" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
}
{"pop":"union-all" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
}    ,
{"pop":"union-exchange" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75
} }    ,
{"pop":"project" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75},"outputProj":true
},
{"pop":"screen" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":{"memoryCost":4194304,"outputRowCount":41468.75}
}
]
}

Please let me know if any other information you need.

Thanks,
Sreeparna Bhabani

On Wed, May 6, 2020 at 2:05 PM sreeparna bhabani <
bhabani.sreeparna@gmail.com> wrote:

>
> ---------- Forwarded message ---------
> From: Paul Rogers <pa...@yahoo.com>
> Date: Tue, 5 May 2020, 13:21
> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
> To: <us...@drill.apache.org>
> Cc: Sreeparna Bhabani <bh...@gmail.com>, <ar...@gmail.com>,
> <na...@gmail.com>
>
>
> Hi Sreeparna,
>
>
> Thanks much for digging into the details. SQL is pretty complex and things
> don't always work as we might expect.
>
>
> The first question is: which plan is correct: the (Parquet UNION ALL DB)
> plan or the (Parquet UNION ALL Parquet) plan? I tried poking around, and
> got no definitive answer on whether UNION ALL implies ordering. On the
> one hand, the parts of the standards that I could find didn't seem to imply
> ordering. A UNION (no ALL) can't imply ordering since it essentially does
> an anti-join which may be hash-partitioned. But, a StackOverflow post
> suggested that there is an implied ordering in the case of an ORDER BY on
> the sub-queries:
>
>
> (SELECT ... ORDER BY ...)
>
> UNION AL
>
> (SELECT ... ORDER BY ...)
>
>
> That is, if we can sort each sub-query, then doing so only makes sense if
> all results of the first are returned before any results of the second.
> (Have not checked if the above is valid in Drill. Even if it was, the
> planner should handle the above as a special case.)
>
>
> So, let's assume that the (Parquet UNION ALL Parquet) case is the correct
> behavior. Then, we can speculate that the planner is getting confused
> somehow in the mixed case. Each data source (file for Parquet, JDBC for the
> DB) parallelizes independently. Each decides it needs just one fragment.
> Somehow the planner must be saying, "well, if they both only want one
> fragment, let's run the whole query in a single fragment."
>
>
> Perhaps the decision is based on row count and the planner somehow thinks
> the row counts will be small for one or both of the queries. In fact, what
> happens if you do a query with (DB UNION ALL DB)? Drill's ability to
> estimate row counts is poor, especially from JDBC. Perhaps the planner is
> guessing the tables are small and parallelizing is unnecessary.
>
>
> My advice is to file a JIRA ticket with as much detail as you can provide.
> Certainly the information from your-email. Ideally, also the query (with
> names redacted if necessary.) Also, the JSON query plan (obtained by using
> EXPLAIN PLAN FOR), again with names redacted if necessary.
>
>
> With that info, we can dig a bit deeper to determine why the two cases
> come out differently.
>
>
> Thanks,
>
> - Paul
>
>
>
> On Monday, May 4, 2020, 9:39:12 AM PDT, sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
>
> Hi Team,
>
> After further checking on this UNION ALL, I found that UNION ALL
> (between Parquet and database) behaves as expected with limited number of
> rows and columns. But for a larger Parquet file and higher number of
> selected rows and columns, the UNION ALL takes much higher time than sum of
> the same of individual Parquet and DB Query.
>
> As per the analysis, it looks like the source of this issue is-
> Although we are using distributed mode, the UNION ALL query is executed
> only on 1 NODE in case of Parquet UNION ALL DB. It is not distributed and
> parallelized in multiple nodes.
>
> Whereas, for individual query or UNION ALL between same type datasets
> (Parquet + Parquet) it is getting distributed in 2 NODES.
>
> Do you have any finding / idea on this ?
>
> Thanks,
> Sreeparna Bhabani
>
> On Tue, Apr 28, 2020 at 9:00 PM sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
> > Hi Paul Team,
> >
> > Please check the observation mentioned in the  below Jira where we found
> > that UNION ALL query is not parallelized between multiple nodes when
> there
> > are 2 types dataset (Parquet and Database). But it is parallelized if we
> > query individual Parquet file.
> >
> > Is there any way to enforce parallel execution in multiple nodes ?
> >
> > Thanks,
> > Sreeparna Bhabani
> >
> >
> > On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <
> bhabani.sreeparna@gmail.com>
> > wrote:
> >
> >>
> >> Hi Paul and Team,
> >>
> >> As you suggested I have created a Jira ticket which is  -
> >> https://issues.apache.org/jira/browse/DRILL-7720.
> >> I have mentioned details in the Jira you asked. Please have a look. As
> >> the data is sensitive, I am trying to create dummy dataset. Will
> >> provide once it is ready.
> >>
> >> Thanks,
> >> Sreeparna Bhabani
> >>
> >> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
> >> bhabani.sreeparna@gmail.com> wrote:
> >>
> >>>
> >>> ---------- Forwarded message ---------
> >>> From: Paul Rogers <pa...@yahoo.com>
> >>> Date: Thu, 23 Apr 2020, 23:59
> >>> Subject: Re: Suggestion needed for UNION ALL performance in Apache
> drill
> >>> To: <us...@drill.apache.org>, sreeparna bhabani <
> >>> bhabani.sreeparna@gmail.com>
> >>> Cc: <ar...@gmail.com>, <na...@gmail.com>
> >>>
> >>>
> >>> Hi Sreeparna,
> >>>
> >>>
> >>> As suggested in the earlier e-mail, we would not expect to see
> different
> >>> performance in UNION ALL than in a simple scan. Clearly you've found
> some
> >>> kind of issue. The next step is to investigate that issue, which is a
> bit
> >>> hard to do over e-mail.
> >>>
> >>>
> >>> Please file a JIRA ticket to describe the issue and provide a
> >>> reproducible test case including query and data. If your data is
> sensitive,
> >>> please create a dummy data set, or use the provided TPC-H data set to
> >>> recreate the issue. We can then take a look to see what might be
> happening.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> - Paul
> >>>
> >>>
> >>>
> >>> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
> >>> bhabani.sreeparna@gmail.com> wrote:
> >>>
> >>>
> >>> Hi Team,
> >>>
> >>> In addition to the below mail I have another finding. Please consider
> >>> below scenarios. The first 2 scenarios are giving expected results in
> terms
> >>> of performance. But we are not getting expected performance for 3rd
> >>> scenario which is UNION ALL with 2 different types of datasets.
> >>>
> >>> *Scenario 1- Parquet UNION ALL Parquet*
> >>> Individual execution time of 1st query - 5 secs
> >>> Individual execution time of 2nd query - 5 secs
> >>> UNION ALL of both queries execution time - 10 secs
> >>>
> >>> *Scenario 2 - DB query UNION ALL DB* *query*
> >>> Individual execution time of 1st query - 5 secs
> >>> Individual execution time of 2nd query - 5 secs
> >>> UNION ALL of both queries execution time - 10 secs
> >>>
> >>> *Scenario 3 - Parquet UNION ALL DB query*
> >>> Individual execution time of 1st query - 5 secs
> >>> Individual execution time of 2nd query - 1 sec
> >>> UNION ALL execution time - 20 secs
> >>> Ideally the execution time should not be more than 6 secs.
> >>>
> >>> May I request you to check whether the UNION ALL performance of 3rd
> >>> scenario is expected with different dataset types.
> >>>
> >>> Please suggest if there is any specific way to bring down the execution
> >>> time of 3rd scenario.
> >>>
> >>> Thanks in advance.
> >>>
> >>> Sreeparna Bhabani
> >>>
> >>>
> >>>
> >>> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <
> >>> bhabani.sreeparna@gmail.com> wrote:
> >>>
> >>> Hi Team,
> >>>
> >>> Apart from the below issue I have another question.
> >>>
> >>> Is there any relation between number of row groups and performance ?
> >>>
> >>> In the below query the number of files is 13 and numRowGroups is 69. Is
> >>> the UNION ALL takes more time if the number of rowgroup is high like
> that.
> >>>
> >>> Please note that the individual Parquet query takes 6 secs. But UNION
> >>> ALL takes 20 secs. Details are given in trail mail.
> >>>
> >>> Thanks,
> >>> Sreeparna Bhabani
> >>>
> >>> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <di...@gmail.com>
> >>> wrote:
> >>>
> >>> Hi Paul,
> >>>
> >>> Please find the details below. We are using 2 drillbits. Heap memory 16
> >>> G, Max direct memory 32 G. One query selects from Parquet. Another one
> >>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL.
> There
> >>> is not sorting.
> >>>
> >>> Single parquet query-
> >>> Total execution time - 6.6 sec
> >>> Scan time - 0.152 sec
> >>> Screen wait time - 5.3 sec
> >>>
> >>> Single JDBC query-
> >>> Total execution time - 0.261 sec
> >>> JDBC scan - 0.152 sec
> >>> Screen wait - 0.004 sec
> >>>
> >>>
> >>> Union all query -
> >>> Execution time - 21. 118 sec
> >>> Screen wait time - 5.351 sec
> >>> Parquet scan - 15.368 sec
> >>> Unordered receiver wait time - 14.41 sec
> >>>
> >>> Thanks,
> >>> Sreeparna Bhabani
> >>>
> >>>
> >>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:
> >>>
> >>> Hi Sreeparna,
> >>>
> >>>
> >>> The short answer is it *should* work: a UNION ALL is simply an append.
> >>> (Be sure you are not using a plain UNION as that needs to do more work
> to
> >>> remove duplicates.)
> >>>
> >>>
> >>> Since you are seeing unexpected behavior, we may have some kind of
> issue
> >>> to investigate and perhaps fix. Always hard to do over e-mail, but
> let's
> >>> see what we can do.
> >>>
> >>>
> >>> The first question is to understand the full query: are you doing more
> >>> than a simple scan of two files and a UNION ALL? Are there sorts or
> joins
> >>> involved?
> >>>
> >>>
> >>> The best place to start to investigate performance issues is the query
> >>> profile, which it looks like you are doing. What is the time for the
> scans
> >>> if you run each of the two scans separately? You said that they take 8
> and
> >>> 1 seconds. Is that for the whole query or just the scan operators?
> >>>
> >>>
> >>> Then, when you run the UNION ALL, again looking at the scan operators,
> >>> is there any difference in run times? If the scans take longer, that
> is one
> >>> thing to investigate. If the scans take the same amount of time, what
> other
> >>> operator(s) are taking the rest of the time? Your note suggests that
> it is
> >>> the scan taking the time. But, there should be two scan operators: one
> for
> >>> each file. How is the time divided between them?
> >>>
> >>>
> >>> How large are the data files? Using what storage system? How many
> >>> Drillbits? How much memory?
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> - Paul
> >>>
> >>>
> >>>
> >>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
> >>> bhabani.sreeparna@gmail.com> wrote:
> >>>
> >>>
> >>> Hi Team,
> >>>
> >>> I reach out to you for a specific problem regarding UNION ALL. There is
> >>> one
> >>> UNION ALL statement which combines 2 queries. The individual queries
> are
> >>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
> >>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
> >>> 1.17.
> >>>
> >>> Please help to suggest how to improve this UNION ALL performance. We
> are
> >>> using parquet file.
> >>>
> >>> Thanks,
> >>> Sreeparna Bhabani
> >>>
> >>>
> >>
> >> --
> >>
> >> Thanks n Regards,
>
> >> *Sreeparna Bhabani*
> >>
> >
>
> --
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>


-- 

Thanks n Regards,
*Sreeparna Bhabani*

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Sreeparna,

Thanks much for digging into the details. SQL is pretty complex and things don't always work as we might expect.

The first question is: which plan is correct: the (Parquet UNION ALL DB) plan or the (Parquet UNION ALL Parquet) plan? I tried poking around, and got no definitive answer on whether UNION ALL implies ordering. On the one hand, the parts of the standards that I could find didn't seem to imply ordering. A UNION (no ALL) can't imply ordering since it essentially does an anti-join which may be hash-partitioned. But, a StackOverflow post suggested that there is an implied ordering in the case of an ORDER BY on the sub-queries:

(SELECT ... ORDER BY ...)
UNION AL
 (SELECT ... ORDER BY ...)

That is, if we can sort each sub-query, then doing so only makes sense if all results of the first are returned before any results of the second. (Have not checked if the above is valid in Drill. Even if it was, the planner should handle the above as a special case.)


So, let's assume that the (Parquet UNION ALL Parquet) case is the correct behavior. Then, we can speculate that the planner is getting confused somehow in the mixed case. Each data source (file for Parquet, JDBC for the DB) parallelizes independently. Each decides it needs just one fragment. Somehow the planner must be saying, "well, if they both only want one fragment, let's run the whole query in a single fragment."

Perhaps the decision is based on row count and the planner somehow thinks the row counts will be small for one or both of the queries. In fact, what happens if you do a query with (DB UNION ALL DB)? Drill's ability to estimate row counts is poor, especially from JDBC. Perhaps the planner is guessing the tables are small and parallelizing is unnecessary.


My advice is to file a JIRA ticket with as much detail as you can provide. Certainly the information from your-email. Ideally, also the query (with names redacted if necessary.) Also, the JSON query plan (obtained by using EXPLAIN PLAN FOR), again with names redacted if necessary.

With that info, we can dig a bit deeper to determine why the two cases come out differently.

Thanks,
- Paul

 

    On Monday, May 4, 2020, 9:39:12 AM PDT, sreeparna bhabani <bh...@gmail.com> wrote:  
 
 Hi Team,

After further checking on this UNION ALL, I found that UNION ALL
(between Parquet and database) behaves as expected with limited number of
rows and columns. But for a larger Parquet file and higher number of
selected rows and columns, the UNION ALL takes much higher time than sum of
the same of individual Parquet and DB Query.

As per the analysis, it looks like the source of this issue is-
Although we are using distributed mode, the UNION ALL query is executed
only on 1 NODE in case of Parquet UNION ALL DB. It is not distributed and
parallelized in multiple nodes.

Whereas, for individual query or UNION ALL between same type datasets
(Parquet + Parquet) it is getting distributed in 2 NODES.

Do you have any finding / idea on this ?

Thanks,
Sreeparna Bhabani

On Tue, Apr 28, 2020 at 9:00 PM sreeparna bhabani <
bhabani.sreeparna@gmail.com> wrote:

> Hi Paul Team,
>
> Please check the observation mentioned in the  below Jira where we found
> that UNION ALL query is not parallelized between multiple nodes when there
> are 2 types dataset (Parquet and Database). But it is parallelized if we
> query individual Parquet file.
>
> Is there any way to enforce parallel execution in multiple nodes ?
>
> Thanks,
> Sreeparna Bhabani
>
>
> On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <bh...@gmail.com>
> wrote:
>
>>
>> Hi Paul and Team,
>>
>> As you suggested I have created a Jira ticket which is  -
>> https://issues.apache.org/jira/browse/DRILL-7720.
>> I have mentioned details in the Jira you asked. Please have a look. As
>> the data is sensitive, I am trying to create dummy dataset. Will
>> provide once it is ready.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
>> bhabani.sreeparna@gmail.com> wrote:
>>
>>>
>>> ---------- Forwarded message ---------
>>> From: Paul Rogers <pa...@yahoo.com>
>>> Date: Thu, 23 Apr 2020, 23:59
>>> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
>>> To: <us...@drill.apache.org>, sreeparna bhabani <
>>> bhabani.sreeparna@gmail.com>
>>> Cc: <ar...@gmail.com>, <na...@gmail.com>
>>>
>>>
>>> Hi Sreeparna,
>>>
>>>
>>> As suggested in the earlier e-mail, we would not expect to see different
>>> performance in UNION ALL than in a simple scan. Clearly you've found some
>>> kind of issue. The next step is to investigate that issue, which is a bit
>>> hard to do over e-mail.
>>>
>>>
>>> Please file a JIRA ticket to describe the issue and provide a
>>> reproducible test case including query and data. If your data is sensitive,
>>> please create a dummy data set, or use the provided TPC-H data set to
>>> recreate the issue. We can then take a look to see what might be happening.
>>>
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>>
>>> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
>>> bhabani.sreeparna@gmail.com> wrote:
>>>
>>>
>>> Hi Team,
>>>
>>> In addition to the below mail I have another finding. Please consider
>>> below scenarios. The first 2 scenarios are giving expected results in terms
>>> of performance. But we are not getting expected performance for 3rd
>>> scenario which is UNION ALL with 2 different types of datasets.
>>>
>>> *Scenario 1- Parquet UNION ALL Parquet*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 5 secs
>>> UNION ALL of both queries execution time - 10 secs
>>>
>>> *Scenario 2 - DB query UNION ALL DB* *query*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 5 secs
>>> UNION ALL of both queries execution time - 10 secs
>>>
>>> *Scenario 3 - Parquet UNION ALL DB query*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 1 sec
>>> UNION ALL execution time - 20 secs
>>> Ideally the execution time should not be more than 6 secs.
>>>
>>> May I request you to check whether the UNION ALL performance of 3rd
>>> scenario is expected with different dataset types.
>>>
>>> Please suggest if there is any specific way to bring down the execution
>>> time of 3rd scenario.
>>>
>>> Thanks in advance.
>>>
>>> Sreeparna Bhabani
>>>
>>>
>>>
>>> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <
>>> bhabani.sreeparna@gmail.com> wrote:
>>>
>>> Hi Team,
>>>
>>> Apart from the below issue I have another question.
>>>
>>> Is there any relation between number of row groups and performance ?
>>>
>>> In the below query the number of files is 13 and numRowGroups is 69. Is
>>> the UNION ALL takes more time if the number of rowgroup is high like that.
>>>
>>> Please note that the individual Parquet query takes 6 secs. But UNION
>>> ALL takes 20 secs. Details are given in trail mail.
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <di...@gmail.com>
>>> wrote:
>>>
>>> Hi Paul,
>>>
>>> Please find the details below. We are using 2 drillbits. Heap memory 16
>>> G, Max direct memory 32 G. One query selects from Parquet. Another one
>>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There
>>> is not sorting.
>>>
>>> Single parquet query-
>>> Total execution time - 6.6 sec
>>> Scan time - 0.152 sec
>>> Screen wait time - 5.3 sec
>>>
>>> Single JDBC query-
>>> Total execution time - 0.261 sec
>>> JDBC scan - 0.152 sec
>>> Screen wait - 0.004 sec
>>>
>>>
>>> Union all query -
>>> Execution time - 21. 118 sec
>>> Screen wait time - 5.351 sec
>>> Parquet scan - 15.368 sec
>>> Unordered receiver wait time - 14.41 sec
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>>
>>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:
>>>
>>> Hi Sreeparna,
>>>
>>>
>>> The short answer is it *should* work: a UNION ALL is simply an append.
>>> (Be sure you are not using a plain UNION as that needs to do more work to
>>> remove duplicates.)
>>>
>>>
>>> Since you are seeing unexpected behavior, we may have some kind of issue
>>> to investigate and perhaps fix. Always hard to do over e-mail, but let's
>>> see what we can do.
>>>
>>>
>>> The first question is to understand the full query: are you doing more
>>> than a simple scan of two files and a UNION ALL? Are there sorts or joins
>>> involved?
>>>
>>>
>>> The best place to start to investigate performance issues is the query
>>> profile, which it looks like you are doing. What is the time for the scans
>>> if you run each of the two scans separately? You said that they take 8 and
>>> 1 seconds. Is that for the whole query or just the scan operators?
>>>
>>>
>>> Then, when you run the UNION ALL, again looking at the scan operators,
>>> is there any difference in run times? If the scans take longer, that is one
>>> thing to investigate. If the scans take the same amount of time, what other
>>> operator(s) are taking the rest of the time? Your note suggests that it is
>>> the scan taking the time. But, there should be two scan operators: one for
>>> each file. How is the time divided between them?
>>>
>>>
>>> How large are the data files? Using what storage system? How many
>>> Drillbits? How much memory?
>>>
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>>
>>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
>>> bhabani.sreeparna@gmail.com> wrote:
>>>
>>>
>>> Hi Team,
>>>
>>> I reach out to you for a specific problem regarding UNION ALL. There is
>>> one
>>> UNION ALL statement which combines 2 queries. The individual queries are
>>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
>>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
>>> 1.17.
>>>
>>> Please help to suggest how to improve this UNION ALL performance. We are
>>> using parquet file.
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>>
>>
>> --
>>
>> Thanks n Regards,
>> *Sreeparna Bhabani*
>>
>

-- 

Thanks n Regards,
*Sreeparna Bhabani*
  

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by sreeparna bhabani <bh...@gmail.com>.
Hi Team,

After further checking on this UNION ALL, I found that UNION ALL
(between Parquet and database) behaves as expected with limited number of
rows and columns. But for a larger Parquet file and higher number of
selected rows and columns, the UNION ALL takes much higher time than sum of
the same of individual Parquet and DB Query.

As per the analysis, it looks like the source of this issue is-
Although we are using distributed mode, the UNION ALL query is executed
only on 1 NODE in case of Parquet UNION ALL DB. It is not distributed and
parallelized in multiple nodes.

Whereas, for individual query or UNION ALL between same type datasets
(Parquet + Parquet) it is getting distributed in 2 NODES.

Do you have any finding / idea on this ?

Thanks,
Sreeparna Bhabani

On Tue, Apr 28, 2020 at 9:00 PM sreeparna bhabani <
bhabani.sreeparna@gmail.com> wrote:

> Hi Paul Team,
>
> Please check the observation mentioned in the  below Jira where we found
> that UNION ALL query is not parallelized between multiple nodes when there
> are 2 types dataset (Parquet and Database). But it is parallelized if we
> query individual Parquet file.
>
> Is there any way to enforce parallel execution in multiple nodes ?
>
> Thanks,
> Sreeparna Bhabani
>
>
> On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <bh...@gmail.com>
> wrote:
>
>>
>> Hi Paul and Team,
>>
>> As you suggested I have created a Jira ticket which is  -
>> https://issues.apache.org/jira/browse/DRILL-7720.
>> I have mentioned details in the Jira you asked. Please have a look. As
>> the data is sensitive, I am trying to create dummy dataset. Will
>> provide once it is ready.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
>> bhabani.sreeparna@gmail.com> wrote:
>>
>>>
>>> ---------- Forwarded message ---------
>>> From: Paul Rogers <pa...@yahoo.com>
>>> Date: Thu, 23 Apr 2020, 23:59
>>> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
>>> To: <us...@drill.apache.org>, sreeparna bhabani <
>>> bhabani.sreeparna@gmail.com>
>>> Cc: <ar...@gmail.com>, <na...@gmail.com>
>>>
>>>
>>> Hi Sreeparna,
>>>
>>>
>>> As suggested in the earlier e-mail, we would not expect to see different
>>> performance in UNION ALL than in a simple scan. Clearly you've found some
>>> kind of issue. The next step is to investigate that issue, which is a bit
>>> hard to do over e-mail.
>>>
>>>
>>> Please file a JIRA ticket to describe the issue and provide a
>>> reproducible test case including query and data. If your data is sensitive,
>>> please create a dummy data set, or use the provided TPC-H data set to
>>> recreate the issue. We can then take a look to see what might be happening.
>>>
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>>
>>> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
>>> bhabani.sreeparna@gmail.com> wrote:
>>>
>>>
>>> Hi Team,
>>>
>>> In addition to the below mail I have another finding. Please consider
>>> below scenarios. The first 2 scenarios are giving expected results in terms
>>> of performance. But we are not getting expected performance for 3rd
>>> scenario which is UNION ALL with 2 different types of datasets.
>>>
>>> *Scenario 1- Parquet UNION ALL Parquet*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 5 secs
>>> UNION ALL of both queries execution time - 10 secs
>>>
>>> *Scenario 2 - DB query UNION ALL DB* *query*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 5 secs
>>> UNION ALL of both queries execution time - 10 secs
>>>
>>> *Scenario 3 - Parquet UNION ALL DB query*
>>> Individual execution time of 1st query - 5 secs
>>> Individual execution time of 2nd query - 1 sec
>>> UNION ALL execution time - 20 secs
>>> Ideally the execution time should not be more than 6 secs.
>>>
>>> May I request you to check whether the UNION ALL performance of 3rd
>>> scenario is expected with different dataset types.
>>>
>>> Please suggest if there is any specific way to bring down the execution
>>> time of 3rd scenario.
>>>
>>> Thanks in advance.
>>>
>>> Sreeparna Bhabani
>>>
>>>
>>>
>>> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <
>>> bhabani.sreeparna@gmail.com> wrote:
>>>
>>> Hi Team,
>>>
>>> Apart from the below issue I have another question.
>>>
>>> Is there any relation between number of row groups and performance ?
>>>
>>> In the below query the number of files is 13 and numRowGroups is 69. Is
>>> the UNION ALL takes more time if the number of rowgroup is high like that.
>>>
>>> Please note that the individual Parquet query takes 6 secs. But UNION
>>> ALL takes 20 secs. Details are given in trail mail.
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <di...@gmail.com>
>>> wrote:
>>>
>>> Hi Paul,
>>>
>>> Please find the details below. We are using 2 drillbits. Heap memory 16
>>> G, Max direct memory 32 G. One query selects from Parquet. Another one
>>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There
>>> is not sorting.
>>>
>>> Single parquet query-
>>> Total execution time - 6.6 sec
>>> Scan time - 0.152 sec
>>> Screen wait time - 5.3 sec
>>>
>>> Single JDBC query-
>>> Total execution time - 0.261 sec
>>> JDBC scan - 0.152 sec
>>> Screen wait - 0.004 sec
>>>
>>>
>>> Union all query -
>>> Execution time - 21. 118 sec
>>> Screen wait time - 5.351 sec
>>> Parquet scan - 15.368 sec
>>> Unordered receiver wait time - 14.41 sec
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>>
>>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:
>>>
>>> Hi Sreeparna,
>>>
>>>
>>> The short answer is it *should* work: a UNION ALL is simply an append.
>>> (Be sure you are not using a plain UNION as that needs to do more work to
>>> remove duplicates.)
>>>
>>>
>>> Since you are seeing unexpected behavior, we may have some kind of issue
>>> to investigate and perhaps fix. Always hard to do over e-mail, but let's
>>> see what we can do.
>>>
>>>
>>> The first question is to understand the full query: are you doing more
>>> than a simple scan of two files and a UNION ALL? Are there sorts or joins
>>> involved?
>>>
>>>
>>> The best place to start to investigate performance issues is the query
>>> profile, which it looks like you are doing. What is the time for the scans
>>> if you run each of the two scans separately? You said that they take 8 and
>>> 1 seconds. Is that for the whole query or just the scan operators?
>>>
>>>
>>> Then, when you run the UNION ALL, again looking at the scan operators,
>>> is there any difference in run times? If the scans take longer, that is one
>>> thing to investigate. If the scans take the same amount of time, what other
>>> operator(s) are taking the rest of the time? Your note suggests that it is
>>> the scan taking the time. But, there should be two scan operators: one for
>>> each file. How is the time divided between them?
>>>
>>>
>>> How large are the data files? Using what storage system? How many
>>> Drillbits? How much memory?
>>>
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>>
>>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
>>> bhabani.sreeparna@gmail.com> wrote:
>>>
>>>
>>> Hi Team,
>>>
>>> I reach out to you for a specific problem regarding UNION ALL. There is
>>> one
>>> UNION ALL statement which combines 2 queries. The individual queries are
>>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
>>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
>>> 1.17.
>>>
>>> Please help to suggest how to improve this UNION ALL performance. We are
>>> using parquet file.
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>>
>>
>> --
>>
>> Thanks n Regards,
>> *Sreeparna Bhabani*
>>
>

-- 

Thanks n Regards,
*Sreeparna Bhabani*

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by sreeparna bhabani <bh...@gmail.com>.
Hi Paul Team,

Please check the observation mentioned in the  below Jira where we found
that UNION ALL query is not parallelized between multiple nodes when there
are 2 types dataset (Parquet and Database). But it is parallelized if we
query individual Parquet file.

Is there any way to enforce parallel execution in multiple nodes ?

Thanks,
Sreeparna Bhabani


On Tue, 28 Apr 2020, 20:46 sreeparna bhabani, <bh...@gmail.com>
wrote:

>
> Hi Paul and Team,
>
> As you suggested I have created a Jira ticket which is  -
> https://issues.apache.org/jira/browse/DRILL-7720.
> I have mentioned details in the Jira you asked. Please have a look. As the
> data is sensitive, I am trying to create dummy dataset. Will provide once
> it is ready.
>
> Thanks,
> Sreeparna Bhabani
>
> On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
>>
>> ---------- Forwarded message ---------
>> From: Paul Rogers <pa...@yahoo.com>
>> Date: Thu, 23 Apr 2020, 23:59
>> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
>> To: <us...@drill.apache.org>, sreeparna bhabani <
>> bhabani.sreeparna@gmail.com>
>> Cc: <ar...@gmail.com>, <na...@gmail.com>
>>
>>
>> Hi Sreeparna,
>>
>>
>> As suggested in the earlier e-mail, we would not expect to see different
>> performance in UNION ALL than in a simple scan. Clearly you've found some
>> kind of issue. The next step is to investigate that issue, which is a bit
>> hard to do over e-mail.
>>
>>
>> Please file a JIRA ticket to describe the issue and provide a
>> reproducible test case including query and data. If your data is sensitive,
>> please create a dummy data set, or use the provided TPC-H data set to
>> recreate the issue. We can then take a look to see what might be happening.
>>
>>
>> Thanks,
>>
>> - Paul
>>
>>
>>
>> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
>> bhabani.sreeparna@gmail.com> wrote:
>>
>>
>> Hi Team,
>>
>> In addition to the below mail I have another finding. Please consider
>> below scenarios. The first 2 scenarios are giving expected results in terms
>> of performance. But we are not getting expected performance for 3rd
>> scenario which is UNION ALL with 2 different types of datasets.
>>
>> *Scenario 1- Parquet UNION ALL Parquet*
>> Individual execution time of 1st query - 5 secs
>> Individual execution time of 2nd query - 5 secs
>> UNION ALL of both queries execution time - 10 secs
>>
>> *Scenario 2 - DB query UNION ALL DB* *query*
>> Individual execution time of 1st query - 5 secs
>> Individual execution time of 2nd query - 5 secs
>> UNION ALL of both queries execution time - 10 secs
>>
>> *Scenario 3 - Parquet UNION ALL DB query*
>> Individual execution time of 1st query - 5 secs
>> Individual execution time of 2nd query - 1 sec
>> UNION ALL execution time - 20 secs
>> Ideally the execution time should not be more than 6 secs.
>>
>> May I request you to check whether the UNION ALL performance of 3rd
>> scenario is expected with different dataset types.
>>
>> Please suggest if there is any specific way to bring down the execution
>> time of 3rd scenario.
>>
>> Thanks in advance.
>>
>> Sreeparna Bhabani
>>
>>
>>
>> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <
>> bhabani.sreeparna@gmail.com> wrote:
>>
>> Hi Team,
>>
>> Apart from the below issue I have another question.
>>
>> Is there any relation between number of row groups and performance ?
>>
>> In the below query the number of files is 13 and numRowGroups is 69. Is
>> the UNION ALL takes more time if the number of rowgroup is high like that.
>>
>> Please note that the individual Parquet query takes 6 secs. But UNION ALL
>> takes 20 secs. Details are given in trail mail.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <di...@gmail.com>
>> wrote:
>>
>> Hi Paul,
>>
>> Please find the details below. We are using 2 drillbits. Heap memory 16
>> G, Max direct memory 32 G. One query selects from Parquet. Another one
>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There
>> is not sorting.
>>
>> Single parquet query-
>> Total execution time - 6.6 sec
>> Scan time - 0.152 sec
>> Screen wait time - 5.3 sec
>>
>> Single JDBC query-
>> Total execution time - 0.261 sec
>> JDBC scan - 0.152 sec
>> Screen wait - 0.004 sec
>>
>>
>> Union all query -
>> Execution time - 21. 118 sec
>> Screen wait time - 5.351 sec
>> Parquet scan - 15.368 sec
>> Unordered receiver wait time - 14.41 sec
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>>
>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:
>>
>> Hi Sreeparna,
>>
>>
>> The short answer is it *should* work: a UNION ALL is simply an append.
>> (Be sure you are not using a plain UNION as that needs to do more work to
>> remove duplicates.)
>>
>>
>> Since you are seeing unexpected behavior, we may have some kind of issue
>> to investigate and perhaps fix. Always hard to do over e-mail, but let's
>> see what we can do.
>>
>>
>> The first question is to understand the full query: are you doing more
>> than a simple scan of two files and a UNION ALL? Are there sorts or joins
>> involved?
>>
>>
>> The best place to start to investigate performance issues is the query
>> profile, which it looks like you are doing. What is the time for the scans
>> if you run each of the two scans separately? You said that they take 8 and
>> 1 seconds. Is that for the whole query or just the scan operators?
>>
>>
>> Then, when you run the UNION ALL, again looking at the scan operators, is
>> there any difference in run times? If the scans take longer, that is one
>> thing to investigate. If the scans take the same amount of time, what other
>> operator(s) are taking the rest of the time? Your note suggests that it is
>> the scan taking the time. But, there should be two scan operators: one for
>> each file. How is the time divided between them?
>>
>>
>> How large are the data files? Using what storage system? How many
>> Drillbits? How much memory?
>>
>>
>> Thanks,
>>
>> - Paul
>>
>>
>>
>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
>> bhabani.sreeparna@gmail.com> wrote:
>>
>>
>> Hi Team,
>>
>> I reach out to you for a specific problem regarding UNION ALL. There is
>> one
>> UNION ALL statement which combines 2 queries. The individual queries are
>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
>> 1.17.
>>
>> Please help to suggest how to improve this UNION ALL performance. We are
>> using parquet file.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>>
>
> --
>
> Thanks n Regards,
> *Sreeparna Bhabani*
>

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by sreeparna bhabani <bh...@gmail.com>.
Hi Paul and Team,

As you suggested I have created a Jira ticket which is  -
https://issues.apache.org/jira/browse/DRILL-7720.
I have mentioned details in the Jira you asked. Please have a look. As the
data is sensitive, I am trying to create dummy dataset. Will provide once
it is ready.

Thanks,
Sreeparna Bhabani

On Fri, Apr 24, 2020 at 11:28 AM sreeparna bhabani <
bhabani.sreeparna@gmail.com> wrote:

>
> ---------- Forwarded message ---------
> From: Paul Rogers <pa...@yahoo.com>
> Date: Thu, 23 Apr 2020, 23:59
> Subject: Re: Suggestion needed for UNION ALL performance in Apache drill
> To: <us...@drill.apache.org>, sreeparna bhabani <
> bhabani.sreeparna@gmail.com>
> Cc: <ar...@gmail.com>, <na...@gmail.com>
>
>
> Hi Sreeparna,
>
>
> As suggested in the earlier e-mail, we would not expect to see different
> performance in UNION ALL than in a simple scan. Clearly you've found some
> kind of issue. The next step is to investigate that issue, which is a bit
> hard to do over e-mail.
>
>
> Please file a JIRA ticket to describe the issue and provide a reproducible
> test case including query and data. If your data is sensitive, please
> create a dummy data set, or use the provided TPC-H data set to recreate the
> issue. We can then take a look to see what might be happening.
>
>
> Thanks,
>
> - Paul
>
>
>
> On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
>
> Hi Team,
>
> In addition to the below mail I have another finding. Please consider
> below scenarios. The first 2 scenarios are giving expected results in terms
> of performance. But we are not getting expected performance for 3rd
> scenario which is UNION ALL with 2 different types of datasets.
>
> *Scenario 1- Parquet UNION ALL Parquet*
> Individual execution time of 1st query - 5 secs
> Individual execution time of 2nd query - 5 secs
> UNION ALL of both queries execution time - 10 secs
>
> *Scenario 2 - DB query UNION ALL DB* *query*
> Individual execution time of 1st query - 5 secs
> Individual execution time of 2nd query - 5 secs
> UNION ALL of both queries execution time - 10 secs
>
> *Scenario 3 - Parquet UNION ALL DB query*
> Individual execution time of 1st query - 5 secs
> Individual execution time of 2nd query - 1 sec
> UNION ALL execution time - 20 secs
> Ideally the execution time should not be more than 6 secs.
>
> May I request you to check whether the UNION ALL performance of 3rd
> scenario is expected with different dataset types.
>
> Please suggest if there is any specific way to bring down the execution
> time of 3rd scenario.
>
> Thanks in advance.
>
> Sreeparna Bhabani
>
>
>
> On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <bh...@gmail.com>
> wrote:
>
> Hi Team,
>
> Apart from the below issue I have another question.
>
> Is there any relation between number of row groups and performance ?
>
> In the below query the number of files is 13 and numRowGroups is 69. Is
> the UNION ALL takes more time if the number of rowgroup is high like that.
>
> Please note that the individual Parquet query takes 6 secs. But UNION ALL
> takes 20 secs. Details are given in trail mail.
>
> Thanks,
> Sreeparna Bhabani
>
> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <di...@gmail.com>
> wrote:
>
> Hi Paul,
>
> Please find the details below. We are using 2 drillbits. Heap memory 16 G,
> Max direct memory 32 G. One query selects from Parquet. Another one selects
> fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There is not
> sorting.
>
> Single parquet query-
> Total execution time - 6.6 sec
> Scan time - 0.152 sec
> Screen wait time - 5.3 sec
>
> Single JDBC query-
> Total execution time - 0.261 sec
> JDBC scan - 0.152 sec
> Screen wait - 0.004 sec
>
>
> Union all query -
> Execution time - 21. 118 sec
> Screen wait time - 5.351 sec
> Parquet scan - 15.368 sec
> Unordered receiver wait time - 14.41 sec
>
> Thanks,
> Sreeparna Bhabani
>
>
> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:
>
> Hi Sreeparna,
>
>
> The short answer is it *should* work: a UNION ALL is simply an append. (Be
> sure you are not using a plain UNION as that needs to do more work to
> remove duplicates.)
>
>
> Since you are seeing unexpected behavior, we may have some kind of issue
> to investigate and perhaps fix. Always hard to do over e-mail, but let's
> see what we can do.
>
>
> The first question is to understand the full query: are you doing more
> than a simple scan of two files and a UNION ALL? Are there sorts or joins
> involved?
>
>
> The best place to start to investigate performance issues is the query
> profile, which it looks like you are doing. What is the time for the scans
> if you run each of the two scans separately? You said that they take 8 and
> 1 seconds. Is that for the whole query or just the scan operators?
>
>
> Then, when you run the UNION ALL, again looking at the scan operators, is
> there any difference in run times? If the scans take longer, that is one
> thing to investigate. If the scans take the same amount of time, what other
> operator(s) are taking the rest of the time? Your note suggests that it is
> the scan taking the time. But, there should be two scan operators: one for
> each file. How is the time divided between them?
>
>
> How large are the data files? Using what storage system? How many
> Drillbits? How much memory?
>
>
> Thanks,
>
> - Paul
>
>
>
> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
>
> Hi Team,
>
> I reach out to you for a specific problem regarding UNION ALL. There is one
> UNION ALL statement which combines 2 queries. The individual queries are
> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
> 1.17.
>
> Please help to suggest how to improve this UNION ALL performance. We are
> using parquet file.
>
> Thanks,
> Sreeparna Bhabani
>
>

-- 

Thanks n Regards,
*Sreeparna Bhabani*

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Sreeparna,

As suggested in the earlier e-mail, we would not expect to see different performance in UNION ALL than in a simple scan. Clearly you've found some kind of issue. The next step is to investigate that issue, which is a bit hard to do over e-mail.


Please file a JIRA ticket to describe the issue and provide a reproducible test case including query and data. If your data is sensitive, please create a dummy data set, or use the provided TPC-H data set to recreate the issue. We can then take a look to see what might be happening.

Thanks,
- Paul

 

    On Thursday, April 23, 2020, 10:18:13 AM PDT, sreeparna bhabani <bh...@gmail.com> wrote:  
 
 Hi Team,
In addition to the below mail I have another finding. Please consider below scenarios. The first 2 scenarios are giving expected results in terms of performance. But we are not getting expected performance for 3rd scenario which is UNION ALL with 2 different types of datasets.

Scenario 1- Parquet UNION ALL Parquet
Individual execution time of 1st query - 5 secsIndividual execution time of 2nd query - 5 secsUNION ALL of both queries execution time - 10 secs
Scenario 2 - DB query UNION ALL DB queryIndividual execution time of 1st query - 5 secsIndividual execution time of 2nd query - 5 secsUNION ALL of both queries execution time - 10 secs
Scenario 3 - Parquet UNION ALL DB query
Individual execution time of 1st query - 5 secsIndividual execution time of 2nd query - 1 secUNION ALL execution time - 20 secsIdeally the execution time should not be more than 6 secs.
May I request you to check whether the UNION ALL performance of 3rd scenario is expected with different dataset types.
Please suggest if there is any specific way to bring down the execution time of 3rd scenario.
Thanks in advance.
Sreeparna Bhabani


On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <bh...@gmail.com> wrote:

Hi Team,
Apart from the below issue I have another question.
Is there any relation between number of row groups and performance ?
In the below query the number of files is 13 and numRowGroups is 69. Is the UNION ALL takes more time if the number of rowgroup is high like that.
Please note that the individual Parquet query takes 6 secs. But UNION ALL takes 20 secs. Details are given in trail mail.
Thanks,Sreeparna Bhabani

On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <di...@gmail.com> wrote:

Hi Paul,
Please find the details below. We are using 2 drillbits. Heap memory 16 G, Max direct memory 32 G. One query selects from Parquet. Another one selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There is not sorting.
Single parquet query-Total execution time - 6.6 secScan time - 0.152 secScreen wait time - 5.3 sec
Single JDBC query-Total execution time - 0.261 secJDBC scan - 0.152 secScreen wait - 0.004 sec

Union all query -Execution time - 21. 118 secScreen wait time - 5.351 secParquet scan - 15.368 secUnordered receiver wait time - 14.41 sec
Thanks,Sreeparna Bhabani

On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:

Hi Sreeparna,

The short answer is it *should* work: a UNION ALL is simply an append. (Be sure you are not using a plain UNION as that needs to do more work to remove duplicates.)

Since you are seeing unexpected behavior, we may have some kind of issue to investigate and perhaps fix. Always hard to do over e-mail, but let's see what we can do.


The first question is to understand the full query: are you doing more than a simple scan of two files and a UNION ALL? Are there sorts or joins involved?

The best place to start to investigate performance issues is the query profile, which it looks like you are doing. What is the time for the scans if you run each of the two scans separately? You said that they take 8 and 1 seconds. Is that for the whole query or just the scan operators?

Then, when you run the UNION ALL, again looking at the scan operators, is there any difference in run times? If the scans take longer, that is one thing to investigate. If the scans take the same amount of time, what other operator(s) are taking the rest of the time? Your note suggests that it is the scan taking the time. But, there should be two scan operators: one for each file. How is the time divided between them?


How large are the data files? Using what storage system? How many Drillbits? How much memory?


Thanks,
- Paul

 

    On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <bh...@gmail.com> wrote:  
 
 Hi Team,

I reach out to you for a specific problem regarding UNION ALL. There is one
UNION ALL statement which combines 2 queries. The individual queries are
taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is 1.17.

Please help to suggest how to improve this UNION ALL performance. We are
using parquet file.

Thanks,
Sreeparna Bhabani
  


  

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by sreeparna bhabani <bh...@gmail.com>.
Hi Team,

In addition to the below mail I have another finding. Please consider below
scenarios. The first 2 scenarios are giving expected results in terms of
performance. But we are not getting expected performance for 3rd scenario
which is UNION ALL with 2 different types of datasets.

*Scenario 1- Parquet UNION ALL Parquet*
Individual execution time of 1st query - 5 secs
Individual execution time of 2nd query - 5 secs
UNION ALL of both queries execution time - 10 secs

*Scenario 2 - DB query UNION ALL DB* *query*
Individual execution time of 1st query - 5 secs
Individual execution time of 2nd query - 5 secs
UNION ALL of both queries execution time - 10 secs

*Scenario 3 - Parquet UNION ALL DB query*
Individual execution time of 1st query - 5 secs
Individual execution time of 2nd query - 1 sec
UNION ALL execution time - 20 secs
Ideally the execution time should not be more than 6 secs.

May I request you to check whether the UNION ALL performance of 3rd
scenario is expected with different dataset types.

Please suggest if there is any specific way to bring down the execution
time of 3rd scenario.

Thanks in advance.

Sreeparna Bhabani



On Thu, 23 Apr 2020, 12:18 sreeparna bhabani, <bh...@gmail.com>
wrote:

> Hi Team,
>
> Apart from the below issue I have another question.
>
> Is there any relation between number of row groups and performance ?
>
> In the below query the number of files is 13 and numRowGroups is 69. Is
> the UNION ALL takes more time if the number of rowgroup is high like that.
>
> Please note that the individual Parquet query takes 6 secs. But UNION ALL
> takes 20 secs. Details are given in trail mail.
>
> Thanks,
> Sreeparna Bhabani
>
> On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <di...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Please find the details below. We are using 2 drillbits. Heap memory 16
>> G, Max direct memory 32 G. One query selects from Parquet. Another one
>> selects fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There
>> is not sorting.
>>
>> Single parquet query-
>> Total execution time - 6.6 sec
>> Scan time - 0.152 sec
>> Screen wait time - 5.3 sec
>>
>> Single JDBC query-
>> Total execution time - 0.261 sec
>> JDBC scan - 0.152 sec
>> Screen wait - 0.004 sec
>>
>>
>> Union all query -
>> Execution time - 21. 118 sec
>> Screen wait time - 5.351 sec
>> Parquet scan - 15.368 sec
>> Unordered receiver wait time - 14.41 sec
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>>
>> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:
>>
>>> Hi Sreeparna,
>>>
>>>
>>> The short answer is it *should* work: a UNION ALL is simply an append.
>>> (Be sure you are not using a plain UNION as that needs to do more work to
>>> remove duplicates.)
>>>
>>>
>>> Since you are seeing unexpected behavior, we may have some kind of issue
>>> to investigate and perhaps fix. Always hard to do over e-mail, but let's
>>> see what we can do.
>>>
>>>
>>> The first question is to understand the full query: are you doing more
>>> than a simple scan of two files and a UNION ALL? Are there sorts or joins
>>> involved?
>>>
>>>
>>> The best place to start to investigate performance issues is the query
>>> profile, which it looks like you are doing. What is the time for the scans
>>> if you run each of the two scans separately? You said that they take 8 and
>>> 1 seconds. Is that for the whole query or just the scan operators?
>>>
>>>
>>> Then, when you run the UNION ALL, again looking at the scan operators,
>>> is there any difference in run times? If the scans take longer, that is one
>>> thing to investigate. If the scans take the same amount of time, what other
>>> operator(s) are taking the rest of the time? Your note suggests that it is
>>> the scan taking the time. But, there should be two scan operators: one for
>>> each file. How is the time divided between them?
>>>
>>>
>>> How large are the data files? Using what storage system? How many
>>> Drillbits? How much memory?
>>>
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>>
>>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
>>> bhabani.sreeparna@gmail.com> wrote:
>>>
>>>
>>> Hi Team,
>>>
>>> I reach out to you for a specific problem regarding UNION ALL. There is
>>> one
>>> UNION ALL statement which combines 2 queries. The individual queries are
>>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
>>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
>>> 1.17.
>>>
>>> Please help to suggest how to improve this UNION ALL performance. We are
>>> using parquet file.
>>>
>>> Thanks,
>>> Sreeparna Bhabani
>>>
>>

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by sreeparna bhabani <bh...@gmail.com>.
Hi Team,

Apart from the below issue I have another question.

Is there any relation between number of row groups and performance ?

In the below query the number of files is 13 and numRowGroups is 69. Is the
UNION ALL takes more time if the number of rowgroup is high like that.

Please note that the individual Parquet query takes 6 secs. But UNION ALL
takes 20 secs. Details are given in trail mail.

Thanks,
Sreeparna Bhabani

On Thu, 23 Apr 2020, 11:08 sreeparna bhabani, <di...@gmail.com>
wrote:

> Hi Paul,
>
> Please find the details below. We are using 2 drillbits. Heap memory 16 G,
> Max direct memory 32 G. One query selects from Parquet. Another one selects
> fron JDBC. The parquet file size is 849 MB. It is UNION ALL. There is not
> sorting.
>
> Single parquet query-
> Total execution time - 6.6 sec
> Scan time - 0.152 sec
> Screen wait time - 5.3 sec
>
> Single JDBC query-
> Total execution time - 0.261 sec
> JDBC scan - 0.152 sec
> Screen wait - 0.004 sec
>
>
> Union all query -
> Execution time - 21. 118 sec
> Screen wait time - 5.351 sec
> Parquet scan - 15.368 sec
> Unordered receiver wait time - 14.41 sec
>
> Thanks,
> Sreeparna Bhabani
>
>
> On Thu, 23 Apr 2020, 10:43 Paul Rogers, <pa...@yahoo.com> wrote:
>
>> Hi Sreeparna,
>>
>>
>> The short answer is it *should* work: a UNION ALL is simply an append.
>> (Be sure you are not using a plain UNION as that needs to do more work to
>> remove duplicates.)
>>
>>
>> Since you are seeing unexpected behavior, we may have some kind of issue
>> to investigate and perhaps fix. Always hard to do over e-mail, but let's
>> see what we can do.
>>
>>
>> The first question is to understand the full query: are you doing more
>> than a simple scan of two files and a UNION ALL? Are there sorts or joins
>> involved?
>>
>>
>> The best place to start to investigate performance issues is the query
>> profile, which it looks like you are doing. What is the time for the scans
>> if you run each of the two scans separately? You said that they take 8 and
>> 1 seconds. Is that for the whole query or just the scan operators?
>>
>>
>> Then, when you run the UNION ALL, again looking at the scan operators, is
>> there any difference in run times? If the scans take longer, that is one
>> thing to investigate. If the scans take the same amount of time, what other
>> operator(s) are taking the rest of the time? Your note suggests that it is
>> the scan taking the time. But, there should be two scan operators: one for
>> each file. How is the time divided between them?
>>
>>
>> How large are the data files? Using what storage system? How many
>> Drillbits? How much memory?
>>
>>
>> Thanks,
>>
>> - Paul
>>
>>
>>
>> On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <
>> bhabani.sreeparna@gmail.com> wrote:
>>
>>
>> Hi Team,
>>
>> I reach out to you for a specific problem regarding UNION ALL. There is
>> one
>> UNION ALL statement which combines 2 queries. The individual queries are
>> taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
>> PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is
>> 1.17.
>>
>> Please help to suggest how to improve this UNION ALL performance. We are
>> using parquet file.
>>
>> Thanks,
>> Sreeparna Bhabani
>>
>

Re: Suggestion needed for UNION ALL performance in Apache drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Sreeparna,

The short answer is it *should* work: a UNION ALL is simply an append. (Be sure you are not using a plain UNION as that needs to do more work to remove duplicates.)

Since you are seeing unexpected behavior, we may have some kind of issue to investigate and perhaps fix. Always hard to do over e-mail, but let's see what we can do.


The first question is to understand the full query: are you doing more than a simple scan of two files and a UNION ALL? Are there sorts or joins involved?

The best place to start to investigate performance issues is the query profile, which it looks like you are doing. What is the time for the scans if you run each of the two scans separately? You said that they take 8 and 1 seconds. Is that for the whole query or just the scan operators?

Then, when you run the UNION ALL, again looking at the scan operators, is there any difference in run times? If the scans take longer, that is one thing to investigate. If the scans take the same amount of time, what other operator(s) are taking the rest of the time? Your note suggests that it is the scan taking the time. But, there should be two scan operators: one for each file. How is the time divided between them?


How large are the data files? Using what storage system? How many Drillbits? How much memory?


Thanks,
- Paul

 

    On Wednesday, April 22, 2020, 11:32:24 AM PDT, sreeparna bhabani <bh...@gmail.com> wrote:  
 
 Hi Team,

I reach out to you for a specific problem regarding UNION ALL. There is one
UNION ALL statement which combines 2 queries. The individual queries are
taking 8 secs and 1 sec respectively. But UNION ALL takes 30 secs.
PARQUET_SCAN_ROW_GROUP takes the maximum time. Apache drill version is 1.17.

Please help to suggest how to improve this UNION ALL performance. We are
using parquet file.

Thanks,
Sreeparna Bhabani