You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Vibhath Ileperuma <vi...@gmail.com> on 2022/01/04 13:24:35 UTC

Clarification on query execution

Hi All,

We are having a distributed impala cluster with 6 impala executors and
there is a kudu table with 3x replication. This table is hash partitioned
by 3 columns (say HC1, HC2, HC3) and there are 2 partitions available from
each column. Also this table is range partitioned by another column (RC1).
Ex:
[image: image.png]

We are executing the following sql on this table.
[image: image.png]

When checking the query profile of this sql from impala web UI, we noticed
that only two executors are used to execute this sql.

   1. Why impala doesn't use all 6 executors for execution? Since we have
   given conditions for HC1 and HC3 in where clause, does impala use only two
   executors to scan two partitions of HC2? Is there a way to distribute the
   sql to all 6 executors?
   2. When executing this type of multiple sqls in parallel, we noticed all
   the queries use same two executors causing high memory usage. To avoid this
   I set the 'replica_preference' query option to 'remote' and then the
   queries dispatched to different two executors. However it introduced a
   small delay in execution time. Is there a way to avoid this high memory
   usage without introducing a delay to execution time.

Thank You.

Best Regards,

Vibhath.

Re: Clarification on query execution

Posted by Tim Armstrong <ti...@gmail.com>.
SCHEDULE_RANDOM_REPLICA may be more what you're looking for than
REPLICA_PREFERENCE. It's available as both a query option and  as a SQL
hint -
https://impala.apache.org/docs/build/html/topics/impala_schedule_random_replica.html
and https://impala.apache.org/docs/build/html/topics/impala_hints.html.
SCHEDULE_RANDOM_REPLICA keeps the processing local to the tablets, but
assigns work randomly across all replicas of the tablet (usually 3).

The original behaviour for Kudu tables is not to subdivide tables between
executors/threads. We did some early work to support splitting up Kudu
tablets into smaller parts but I believe it's only an undocumented option
in recent Impala versions (TARGETED_KUDU_SCAN_RANGE_LENGTH) -
https://issues.apache.org/jira/browse/IMPALA-9792
<https://issues.apache.org/jira/plugins/servlet/mobile#issue/IMPALA-9792>.
Maybe there have been developments since I looked at this last.

On Tue, 4 Jan 2022, 05:24 Vibhath Ileperuma, <vi...@gmail.com>
wrote:

> Hi All,
>
> We are having a distributed impala cluster with 6 impala executors and
> there is a kudu table with 3x replication. This table is hash partitioned
> by 3 columns (say HC1, HC2, HC3) and there are 2 partitions available from
> each column. Also this table is range partitioned by another column (RC1).
> Ex:
> [image: image.png]
>
> We are executing the following sql on this table.
> [image: image.png]
>
> When checking the query profile of this sql from impala web UI, we noticed
> that only two executors are used to execute this sql.
>
>    1. Why impala doesn't use all 6 executors for execution? Since we have
>    given conditions for HC1 and HC3 in where clause, does impala use only two
>    executors to scan two partitions of HC2? Is there a way to distribute the
>    sql to all 6 executors?
>    2. When executing this type of multiple sqls in parallel, we noticed
>    all the queries use same two executors causing high memory usage. To avoid
>    this I set the 'replica_preference' query option to 'remote' and then the
>    queries dispatched to different two executors. However it introduced a
>    small delay in execution time. Is there a way to avoid this high memory
>    usage without introducing a delay to execution time.
>
> Thank You.
>
> Best Regards,
>
> Vibhath.
>