You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by sameera <sa...@gmail.com> on 2022/07/18 18:26:07 UTC

Impala sorting, memory utilization and scratch data usage

Hi Team,

I'm seeing this behaviour regarding memory utilization in impala executors.

I have a table with 298GB of parquet data in S3 and I'm trying to scan all
partitions and do a Top-N operation to understand the memory usage in
executors.

When i set executor startup option mem_limit to 1GB, the query takes 600s.
With 10GB mem_limit it completes within 200s while utilizing memory upto
10GB. This could be because Impala has decided to set
NumScannerThreadsStarted 1 in 1GB mem_limit and NumScannerThreadsStarted
are 16 in 10GB mem_limit settings.

During both tests I didn't see any intermediate files created in scratch
dirs. My question is how does Impala manage to complete the entire table
sorting when memory is limited to 1GB? Please help me to understand how it
internally works. Any design document would be really helpful.

select * from execution_report_s3_lse order by time_sequence limit 1;

Impala version 3.4
6 executor node cluster with dedicated coordinators.
Node spec - 16 Core, 32 GB memory

Thank you,
Sameera.

Re: Impala sorting, memory utilization and scratch data usage

Posted by Tim Armstrong <ti...@gmail.com>.
Gabor is right on the last point - the Top-N node is used when there is an
ORDER BY with a LIMIT <n> clause in the same SELECT block. Each thread
executing the Top-N node only needs to keep n rows from the input during
processing regardless of the input size. This is really very effective and
explains the results you're seeing.

I don't think a design doc was ever written for the Top-N operator. The
comments in the implementation explain how it works at a high level and are
probably the best place to understand how it works
https://github.com/apache/impala/blob/master/be/src/exec/topn-node.h#L120.

"External Merge Sort for Top-K queries" by Chronis et al -
https://dl.acm.org/doi/pdf/10.1145/3318464.3389729 - has a good overview of
techniques for this problem in sections 1 and 2. Impala does 2.3 and falls
back to 2.4 for larger limits, at least last time I looked. I think Gabor
is suggesting something like 2.5 which is a nice optimisation that Impala
doesn't have afaik.

On Tue, 19 Jul 2022 at 02:59, Gabor Kaszab <ga...@cloudera.com> wrote:

> Hi Sameera,
>
> If a sorter (or a TOP-N node in your case) doesn't fit into memory then it
> does a partial sort for the data that fits into memory (called a "run" if
> I'm not mistaken) and writes them into disk. After this it loads another
> subset of the data into memory, does a partial sort again, and spills to
> disk. It does this until the node runs out of rows to be processed. Once
> this happens a merging sort is applied on the partially sorted "runs".
> Specifically in your query the 6 executor nodes each do these partial sorts
> and then the merging sort to get the 2 rows as a result due to the LIMIT
> clause for each executor node. A finishing step is a MERGING-EXCHANGE that
> receives 2 rows from each 6 executors and gets the final result of the
> "ORDER BY LIMIT 2".
> As a result the more memory you allocate to Impala the bigger these
> partially sorted "runs" could be, the easier the merging sort becomes.
> Additionally, I haven't checked but I assume if you have a LIMIT clause
> Impala doesn't have to spill the whole run into memory just the part that
> survives the LIMIT clause (2 rows per run in this case) so the less runs
> you have the less you have to deal with IO to disk.
>
> Update: I gave this a second thought and I assume that a top-n node (when
> you have a LIMIT clause) doesn't even have to spill to disk either as in
> your case it only has to maintain the highest 2 values due to the LIMIT
> clause.
>
> Hope this helps.
>
> On Mon, Jul 18, 2022 at 8:26 PM sameera <sa...@gmail.com> wrote:
>
>> Hi Team,
>>
>> I'm seeing this behaviour regarding memory utilization in impala
>> executors.
>> I have a table with 298GB of parquet data in S3 and I'm trying to scan
>> all partitions and do a Top-N operation to understand the memory usage in
>> executors.
>>
>> When i set executor startup option mem_limit to 1GB, the query takes
>> 600s. With 10GB mem_limit it completes within 200s while utilizing memory
>> upto 10GB. This could be because Impala has decided to set
>> NumScannerThreadsStarted 1 in 1GB mem_limit and NumScannerThreadsStarted
>> are 16 in 10GB mem_limit settings.
>>
>> During both tests I didn't see any intermediate files created in scratch
>> dirs. My question is how does Impala manage to complete the entire table
>> sorting when memory is limited to 1GB? Please help me to understand how it
>> internally works. Any design document would be really helpful.
>>
>> select * from execution_report_s3_lse order by time_sequence limit 1;
>>
>> Impala version 3.4
>> 6 executor node cluster with dedicated coordinators.
>> Node spec - 16 Core, 32 GB memory
>>
>> Thank you,
>> Sameera.
>>
>

Re: Impala sorting, memory utilization and scratch data usage

Posted by Csaba Ringhofer <cs...@cloudera.com>.
Hi!

As Gabor wrote, due the the LIMIT clause a top-n node is created, which
needs a limited number of memory.
This can be seen in the plan in the profile: 01:TOP-N [LIMIT=2]

You can disable this optimisation with query option set
disable_outermost_topn=true; - running the same query with that option
will lead to fully sorting the input data.



On Tue, Jul 19, 2022 at 9:59 AM Gabor Kaszab <ga...@cloudera.com>
wrote:

> Hi Sameera,
>
> If a sorter (or a TOP-N node in your case) doesn't fit into memory then it
> does a partial sort for the data that fits into memory (called a "run" if
> I'm not mistaken) and writes them into disk. After this it loads another
> subset of the data into memory, does a partial sort again, and spills to
> disk. It does this until the node runs out of rows to be processed. Once
> this happens a merging sort is applied on the partially sorted "runs".
> Specifically in your query the 6 executor nodes each do these partial sorts
> and then the merging sort to get the 2 rows as a result due to the LIMIT
> clause for each executor node. A finishing step is a MERGING-EXCHANGE that
> receives 2 rows from each 6 executors and gets the final result of the
> "ORDER BY LIMIT 2".
> As a result the more memory you allocate to Impala the bigger these
> partially sorted "runs" could be, the easier the merging sort becomes.
> Additionally, I haven't checked but I assume if you have a LIMIT clause
> Impala doesn't have to spill the whole run into memory just the part that
> survives the LIMIT clause (2 rows per run in this case) so the less runs
> you have the less you have to deal with IO to disk.
>
> Update: I gave this a second thought and I assume that a top-n node (when
> you have a LIMIT clause) doesn't even have to spill to disk either as in
> your case it only has to maintain the highest 2 values due to the LIMIT
> clause.
>
> Hope this helps.
>
> On Mon, Jul 18, 2022 at 8:26 PM sameera <sa...@gmail.com> wrote:
>
>> Hi Team,
>>
>> I'm seeing this behaviour regarding memory utilization in impala
>> executors.
>> I have a table with 298GB of parquet data in S3 and I'm trying to scan
>> all partitions and do a Top-N operation to understand the memory usage in
>> executors.
>>
>> When i set executor startup option mem_limit to 1GB, the query takes
>> 600s. With 10GB mem_limit it completes within 200s while utilizing memory
>> upto 10GB. This could be because Impala has decided to set
>> NumScannerThreadsStarted 1 in 1GB mem_limit and NumScannerThreadsStarted
>> are 16 in 10GB mem_limit settings.
>>
>> During both tests I didn't see any intermediate files created in scratch
>> dirs. My question is how does Impala manage to complete the entire table
>> sorting when memory is limited to 1GB? Please help me to understand how it
>> internally works. Any design document would be really helpful.
>>
>> select * from execution_report_s3_lse order by time_sequence limit 1;
>>
>> Impala version 3.4
>> 6 executor node cluster with dedicated coordinators.
>> Node spec - 16 Core, 32 GB memory
>>
>> Thank you,
>> Sameera.
>>
>

Re: Impala sorting, memory utilization and scratch data usage

Posted by Gabor Kaszab <ga...@cloudera.com>.
Hi Sameera,

If a sorter (or a TOP-N node in your case) doesn't fit into memory then it
does a partial sort for the data that fits into memory (called a "run" if
I'm not mistaken) and writes them into disk. After this it loads another
subset of the data into memory, does a partial sort again, and spills to
disk. It does this until the node runs out of rows to be processed. Once
this happens a merging sort is applied on the partially sorted "runs".
Specifically in your query the 6 executor nodes each do these partial sorts
and then the merging sort to get the 2 rows as a result due to the LIMIT
clause for each executor node. A finishing step is a MERGING-EXCHANGE that
receives 2 rows from each 6 executors and gets the final result of the
"ORDER BY LIMIT 2".
As a result the more memory you allocate to Impala the bigger these
partially sorted "runs" could be, the easier the merging sort becomes.
Additionally, I haven't checked but I assume if you have a LIMIT clause
Impala doesn't have to spill the whole run into memory just the part that
survives the LIMIT clause (2 rows per run in this case) so the less runs
you have the less you have to deal with IO to disk.

Update: I gave this a second thought and I assume that a top-n node (when
you have a LIMIT clause) doesn't even have to spill to disk either as in
your case it only has to maintain the highest 2 values due to the LIMIT
clause.

Hope this helps.

On Mon, Jul 18, 2022 at 8:26 PM sameera <sa...@gmail.com> wrote:

> Hi Team,
>
> I'm seeing this behaviour regarding memory utilization in impala
> executors.
> I have a table with 298GB of parquet data in S3 and I'm trying to scan all
> partitions and do a Top-N operation to understand the memory usage in
> executors.
>
> When i set executor startup option mem_limit to 1GB, the query takes 600s.
> With 10GB mem_limit it completes within 200s while utilizing memory upto
> 10GB. This could be because Impala has decided to set
> NumScannerThreadsStarted 1 in 1GB mem_limit and NumScannerThreadsStarted
> are 16 in 10GB mem_limit settings.
>
> During both tests I didn't see any intermediate files created in scratch
> dirs. My question is how does Impala manage to complete the entire table
> sorting when memory is limited to 1GB? Please help me to understand how it
> internally works. Any design document would be really helpful.
>
> select * from execution_report_s3_lse order by time_sequence limit 1;
>
> Impala version 3.4
> 6 executor node cluster with dedicated coordinators.
> Node spec - 16 Core, 32 GB memory
>
> Thank you,
> Sameera.
>