You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Sahil Takiar (JIRA)" <ji...@apache.org> on 2018/10/17 20:30:00 UTC
[jira] [Commented] (IMPALA-5004) Switch to sorting node for large TopN queries

    [ https://issues.apache.org/jira/browse/IMPALA-5004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654154#comment-16654154 ] 

Sahil Takiar commented on IMPALA-5004:
--------------------------------------

Did a bit of benchmarking to find a good default threshold and it looks like 96MB is a good value. Here is how I came to that value:

First, I generated some dummy data, essentially a table with 50,000,000 rows with a bunch of random integers.
{code:java}
use tpcds_parquet;
create table tmp stored as parquet as select cast(rand() * 1000000000 as bigint) as col1, cast(rand() * 1000000000 as bigint) as col2
from store_returns, store_sales where store_returns.sr_item_sk = store_sales.ss_item_sk limit 50000000;
{code}
Then I ran the following query with various values for the offset:
{code:java}
 select * from tmp order by col1, col2 limit 100 offset ? {code}
I measured the latency for the query with varying offsets when either a topn operator was used or a sort operator.
||Offset||TopN Latency(s)||Sort Latency (s)||
|1000000|8|34|
|2000000|14|34|
|3000000|20|35|
|4000000|25|35|
|5000000|30|35|
|6000000|36|35|
|7000000|39|36|
|8000000|43|36|
|9000000|46|37|
|10000000|50|36|

I confirmed that there was no memory starvation for any of the queries, and that the Sort operator never spilled to disk.

So it seems that after some point, the sort operator is faster than the TopN operator. I'm not entirely sure why, perhaps Tim's comment above is true. It's also a bit weird that the sort latency almost entirely stays the same, whereas TopN latency scales linearly with the offset value.

Regardless, setting {{topn_bytes_limit}} to a value of 96 mb causes the planner to switch from TopN to Sort somewhere in the 5,000,000 to 6,000,000 range, which looks like the sweet spot where sort starts to perform faster than topn.

> Switch to sorting node for large TopN queries
> ---------------------------------------------
>
>                 Key: IMPALA-5004
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5004
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.9.0
>            Reporter: Lars Volker
>            Assignee: Sahil Takiar
>            Priority: Major
>
> As explained by [~tarmstrong] in IMPALA-4995:
> bq. We should also consider switching to the sort operator for large limits. This allows it to spill. The memory requirements for TopN also are problematic for large limits, since it would allocate large vectors that are untracked and also require a large amount of contiguous memory.
> There's already logic to select TopN vs. Sort: [planner/SingleNodePlanner.java#L289|https://github.com/apache/incubator-impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L289]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org