You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@impala.apache.org by Dinushka <di...@yahoo.com> on 2019/10/31 01:09:49 UTC

Impala slow at order by

Hi,
I'm a student doing an internship, I have been given a task to do DB performance testing for kudu with Impala for our data and use case.
Sample dataset is about 150 million records with 150 columns  and total size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by hash (X =4,Y=2,Z=2)
SQL 1= "select  A from table where  G="value""SQL 2= "select  A from table where  G="value" order by Z"

I'm testing kudu and Impala in standalone mode and have 2 queries which will only return one row. One with "order by" and other without "order by".
When I do testing, I found that Impala with order by is about 15% to 35% slow. when you have order by in the SQL.
In large row counts queries, it's time can be about 2-20 times more.
1) Why is Impala slow with order by?
2) Can order by  be made faster in clustered mode, that mean made to be parallelized ? 
3) Is it a good idea to use order by with Impala? if so have any body use it with a larger data set with good performance.
4) Is there any other solutions to do fast order by queries within few seconds. (Interactive query engines) 
Thank you

Re: Impala slow at order by

Posted by Csaba Ringhofer <cs...@cloudera.com>.

> SQL 1= "select  A from table where  G="value""
> SQL 2= "select  A from table where  G="value" order by Z"
Note that SQL 2 does not only add an ORDER BY, but also another column to
read, Z, so 3 columns have to be read instead of 2.
This leads to extra work for both Impala and Kudu (it is a column oriented
storage engine, so has to read more data if there are more columns).

A fairer way to test the effect of order by would be to always add Z to the
query, e.g.
select  A, Z from table where  G="value"
select  A, Z from table where  G="value" order by Z


On Thu, Oct 31, 2019 at 3:51 AM Tim Armstrong <ta...@cloudera.com>
wrote:

> > 1) Why is Impala slow with order by?
> Impala's sort implementation is generally very fast, but the bottleneck
> for a query can be in many different places.
>
> Like Shant said, the query profile will contain all the info needed to
> determine where time was spent in the query. We don't have enough
> information now to suggest why your query is performing the way it is. I'm
> surprised that a sort operation on a single row would make that much
> difference in query runtime.
>
> > 2) Can order by  be made faster in clustered mode, that mean made to
> be parallelized ?
> Yes, sorts are parallelised across the cluster, *except* for the final
> merge stage where the rows are collected and streamed back to the client.
>
> If you want to play around with increasing the parallelism of the sort,
> there is a query option mt_dop that switches to a different multithreading
> mode where you can increase the number of threads executing the plan,
> including sort. E.g. if you set mt_dop=4 then on each node you will get 4
> threads doing the sort (assuming there are enough input files that the work
> can be partitioned).
>
> > 3) Is it a good idea to use order by with Impala? if so have any body
> use it with a larger data set with good performance.
> If you're running a SELECT with an ORDER BY that returns a lot of rows,
> the bottleneck is most likely the output - Impala can probably sort the
> rows much faster than the client can fetch them.
>
> This actually makes it kinda hard to benchmark sort performance. We have
> some targeted benchmarks that use analytic functions to test sort
> performance to avoid the problem (it works because the query plans for
> analytic functions involve a sort). See
> https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test
>
> > 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
> This is exactly what Impala is good at.
>
> On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian <sh...@arcadiadata.com>
> wrote:
>
>> Hi Dinushka!
>>
>> That's awesome you're working with Apache Impala for your internship! You
>> should know the Impala website provides some incredible documentation. In
>> particular check out the performance section to learn about how to
>> benchmark queries.
>> https://impala.apache.org/docs/build/html/topics/impala_performance.html
>>
>> As you mentioned since it's in "standalone mode" all sorts of things can
>> be going, however the easiest way to get some insights to do look at the
>> explain plan and profile for the queries you run. For examples take a look
>> at this page
>> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
>>
>> Any chance you can share the EXPLAIN and PROFILE statement output of your
>> queries? You should also take a look at the output of the SUMMARY statement
>> for yourself in case anything obvious stands out to you.
>>
>> Thanks!
>>
>> On Wed, Oct 30, 2019 at 9:10 PM Dinushka <di...@yahoo.com> wrote:
>>
>>> Hi,
>>>
>>> I'm a student doing an internship, I have been given a task to do DB
>>> performance testing for kudu with Impala for our data and use case.
>>>
>>> Sample dataset is about 150 million records with 150 columns  and total
>>> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
>>> hash (X =4,Y=2,Z=2)
>>>
>>> SQL 1= "select  A from table where  G="value""
>>> SQL 2= "select  A from table where  G="value" order by Z"
>>>
>>> I'm testing kudu and Impala in standalone mode and have 2 queries which
>>> will only return one row. One with "order by" and other without "order
>>> by".
>>>
>>> When I do testing, I found that Impala with order by is about 15% to
>>> 35% slow. when you have order by in the SQL.
>>>
>>> In large row counts queries, it's time can be about 2-20 times more.
>>>
>>> 1) Why is Impala slow with order by?
>>>
>>> 2) Can order by  be made faster in clustered mode, that mean made to be parallelized
>>> ?
>>>
>>> 3) Is it a good idea to use order by with Impala? if so have any body
>>> use it with a larger data set with good performance.
>>>
>>> 4) Is there any other solutions to do fast order by queries within few
>>> seconds. (Interactive query engines)
>>>
>>>
>>> Thank you
>>>
>>

Re: Impala slow at order by

Posted by Tim Armstrong <ta...@cloudera.com>.

> 1) Why is Impala slow with order by?
Impala's sort implementation is generally very fast, but the bottleneck for
a query can be in many different places.

Like Shant said, the query profile will contain all the info needed to
determine where time was spent in the query. We don't have enough
information now to suggest why your query is performing the way it is. I'm
surprised that a sort operation on a single row would make that much
difference in query runtime.

> 2) Can order by  be made faster in clustered mode, that mean made to
be parallelized ?
Yes, sorts are parallelised across the cluster, *except* for the final
merge stage where the rows are collected and streamed back to the client.

If you want to play around with increasing the parallelism of the sort,
there is a query option mt_dop that switches to a different multithreading
mode where you can increase the number of threads executing the plan,
including sort. E.g. if you set mt_dop=4 then on each node you will get 4
threads doing the sort (assuming there are enough input files that the work
can be partitioned).

> 3) Is it a good idea to use order by with Impala? if so have any body use
it with a larger data set with good performance.
If you're running a SELECT with an ORDER BY that returns a lot of rows, the
bottleneck is most likely the output - Impala can probably sort the rows
much faster than the client can fetch them.

This actually makes it kinda hard to benchmark sort performance. We have
some targeted benchmarks that use analytic functions to test sort
performance to avoid the problem (it works because the query plans for
analytic functions involve a sort). See
https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test

> 4) Is there any other solutions to do fast order by queries within few
seconds. (Interactive query engines)
This is exactly what Impala is good at.

On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian <sh...@arcadiadata.com>
wrote:

> Hi Dinushka!
>
> That's awesome you're working with Apache Impala for your internship! You
> should know the Impala website provides some incredible documentation. In
> particular check out the performance section to learn about how to
> benchmark queries.
> https://impala.apache.org/docs/build/html/topics/impala_performance.html
>
> As you mentioned since it's in "standalone mode" all sorts of things can
> be going, however the easiest way to get some insights to do look at the
> explain plan and profile for the queries you run. For examples take a look
> at this page
> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
>
> Any chance you can share the EXPLAIN and PROFILE statement output of your
> queries? You should also take a look at the output of the SUMMARY statement
> for yourself in case anything obvious stands out to you.
>
> Thanks!
>
> On Wed, Oct 30, 2019 at 9:10 PM Dinushka <di...@yahoo.com> wrote:
>
>> Hi,
>>
>> I'm a student doing an internship, I have been given a task to do DB
>> performance testing for kudu with Impala for our data and use case.
>>
>> Sample dataset is about 150 million records with 150 columns  and total
>> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
>> hash (X =4,Y=2,Z=2)
>>
>> SQL 1= "select  A from table where  G="value""
>> SQL 2= "select  A from table where  G="value" order by Z"
>>
>> I'm testing kudu and Impala in standalone mode and have 2 queries which
>> will only return one row. One with "order by" and other without "order by
>> ".
>>
>> When I do testing, I found that Impala with order by is about 15% to 35%
>> slow. when you have order by in the SQL.
>>
>> In large row counts queries, it's time can be about 2-20 times more.
>>
>> 1) Why is Impala slow with order by?
>>
>> 2) Can order by  be made faster in clustered mode, that mean made to be parallelized
>> ?
>>
>> 3) Is it a good idea to use order by with Impala? if so have any body use
>> it with a larger data set with good performance.
>>
>> 4) Is there any other solutions to do fast order by queries within few
>> seconds. (Interactive query engines)
>>
>>
>> Thank you
>>
>

Re: Impala slow at order by

Posted by Shant Hovsepian <sh...@arcadiadata.com>.

Hi Dinushka!

That's awesome you're working with Apache Impala for your internship! You
should know the Impala website provides some incredible documentation. In
particular check out the performance section to learn about how to
benchmark queries.
https://impala.apache.org/docs/build/html/topics/impala_performance.html

As you mentioned since it's in "standalone mode" all sorts of things can be
going, however the easiest way to get some insights to do look at the
explain plan and profile for the queries you run. For examples take a look
at this page
https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html

Any chance you can share the EXPLAIN and PROFILE statement output of your
queries? You should also take a look at the output of the SUMMARY statement
for yourself in case anything obvious stands out to you.

Thanks!

On Wed, Oct 30, 2019 at 9:10 PM Dinushka <di...@yahoo.com> wrote:

> Hi,
>
> I'm a student doing an internship, I have been given a task to do DB
> performance testing for kudu with Impala for our data and use case.
>
> Sample dataset is about 150 million records with 150 columns  and total
> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
> hash (X =4,Y=2,Z=2)
>
> SQL 1= "select  A from table where  G="value""
> SQL 2= "select  A from table where  G="value" order by Z"
>
> I'm testing kudu and Impala in standalone mode and have 2 queries which
> will only return one row. One with "order by" and other without "order by"
> .
>
> When I do testing, I found that Impala with order by is about 15% to 35%
> slow. when you have order by in the SQL.
>
> In large row counts queries, it's time can be about 2-20 times more.
>
> 1) Why is Impala slow with order by?
>
> 2) Can order by  be made faster in clustered mode, that mean made to be parallelized
> ?
>
> 3) Is it a good idea to use order by with Impala? if so have any body use
> it with a larger data set with good performance.
>
> 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
>
>
> Thank you
>