You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@impala.apache.org by Fawze Abujaber <fa...@gmail.com> on 2018/05/19 13:47:51 UTC

Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Hi Community,

I'm investigating the peak times where the impala daeomns memory were
consumed so i can distribute my queries in the right way.

While looking into one scenario, i see one query with the below stats:

*** The query has several joins and this is the reason why it take much
time.

Duration: 4.1m
Rows Produced: 30102
Aggregate Peak Memory Usage: 6.5 GiB
Per Node Peak Memory Usage: 6.5 GiB
The total number of bytes read from HDFS :727 MiB
Memory Accrual: 86228522168 byte seconds
Pool: default-pool
Query State: FINISHED
Threads: CPU Time: 8m

====================

I'm intersting to understand why Aggregate Peak Memory Usage and Per Node
Peak Memory Usage are identical, while looking in the query profile i see
it ran several fragemments on different nodes.

Also i see that this query all the times it ran, it has the mentioned 2
parmeters with identical values:( Daily scheduled query)

What i'm trying to understand:

1) If the query has several fragments, shouldn't the 2 parmeters be
different?

2) Is this scenrio can happen since the HDFS reads byte is small and it may
cause all the data to be read from single node?

3) Since i see that the node that was with peak memory is the coordinator,
and i see that the 2 parameters are identical, is it mean that the
cordinator also executed most of  the query?

4) while thinking when these 2 metrics can have the same value is that at a
particular time there was one node participating in the query coordinations
and execution which is for sure will be the coordinator node and at that
point the query has the highest aggregarte memory consumption.

Is my assumption true?

5) If my previous assumption is true, then, is there anyway to force the
coordinator to un participate in the query consumption? ( Since have 2-3
queries running at the same time with such scenrio will fail)

6) While looking in the fragments i see the following on one of the
fragements:

PeakMemoryUsage: 141.1 MiB
PerHostPeakMemUsage: 6.5 GiB

IS the peakMemoryUsage is refer to the node that ran the specific fragment
and the perHostPeakMemUsage refers to the node with the peak memory cross
the cluster in this soecifc query?


-- 
Take Care
Fawze Abujaber

Re: Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Posted by Fawze Abujaber <fa...@gmail.com>.

Hi Mostafa,

Thanks for you response.

We are using Spark to crete hdfs parquet files and we defined external
tables on top of these parquet files.

Since we are running 100s of scheduled queries on these tables, i'm not
sure that running compute stats can change the the scenario here.

Maybe i would like to separate these issues and first understand what
happened and understand the numbers i see and where it come from, and then
maybe try to enhance the analyzer.

Lastly, i need to find a way to run compute stats on the tables.


On Sun, May 20, 2018 at 2:11 AM, Mostafa Mokhtar <mm...@cloudera.com>
wrote:

> If using Spark to load data to be queried by Impala please
> use spark.sql.parquet.writeLegacyFormat.
>
> Please set the number of rows to tables and distinct count to the columns
> using
> https://www.cloudera.com/documentation/enterprise/5-8-
> x/topics/impala_perf_stats.html
>
>
>
> On Sat, May 19, 2018 at 1:56 PM, Fawze Abujaber <fa...@gmail.com> wrote:
>
>> Hi Mustafa,
>>
>> I cann't run compute stats on the dv tables and i think the reason that
>> these tables data writen using spark sql.
>>
>>
>>    - r-00000-5752904c-b197-48c0-8f9b-b629e727a5db.snappy.parquet' has an
>>    incompatible Parquet schema for column 'default.dv_engagementendevent_prq_local.dsptlocation'.
>>    Column type: STRING, Parquet schema: optional struct EngagementEndEvent
>>    [i:-1 d:1 r:0] { optional int64 engagement_end_time [i:201 d:2 r:0]
>>    optional int64 end_reason [i:202 d:2 r:0] optional byte_array agent_id
>>    [i:203 d:2 r:0] optional byte_array service_queue_id [i:204 d:2 r:0]
>>    optional int64 skill_id [i:205 d:2 r:0] optional int32
>>    interactive_indication [i:206 d:2 r:0] optional int64
>>    engagement_set_sequence [i:207 d:2 r:0] optional int64 engagement_sequence
>>    [i:208 d:2 r:0] optional int32 old_channel [i:209 d:2 r:0] optional int64
>>    millis_in_queue [i:210 d:2 r:0] optional int64 target_skill_id [i:211 d:2
>>    r:0] optional byte_array target_agent_id [i:212 d:2 r:0] optional
>>    byte_array transcript_identifier [i:213 d:2 r:0] optional int64
>>    call_connect_time [i:214 d:2 r:0] optional int64 conversation_start_time
>>    [i:215 d:2 r:0] optional int64 precall_ivr_connect_time [i:216 d:2 r:0]
>>    optional int64 s_call_request_time [i:217 d:2 r:0] optional int64
>>    s_call_scheduled_time [i:218 d:2 r:0] optional byte_array units [i:219 d:2
>>    r:0] optional byte_array from_phone_number [i:220 d:2 r:0] optional
>>    byte_array to_phone_number [i:221 d:2 r:0] optional byte_array did_number
>>    [i:222 d:2 r:0] optional int32 from_participant_kind [i:223 d:2 r:0]
>>    optional int32 to_participant_kind [i:224 d:2 r:0] optional byte_array
>>    technical_call_failed_reason [i:225 d:2 r:0] optional byte_array
>>    call_disconnector [i:226 d:2 r:0] optional int32 last_call_type [i:227 d:2
>>    r:0] optional int64 call_end_reason [i:228 d:2 r:0] optional byte_array
>>    agent_session_id [i:229 d:2 r:0] optional int32 channel [i:230 d:2 r:0]
>>    optional byte_array skill_name [i:231 d:2 r:0] optional byte_array
>>    target_skill_name [i:232 d:2 r:0] optional byte_array service_queue_name
>>    [i:233 d:2 r:0] optional int64 engagement_start_time [i:234 d:2 r:0]
>>    optional int64 interaction_start_time [i:235 d:2 r:0] optional int64
>>    agent_group_id [i:236 d:2 r:0] optional int64 chat_center_id [i:237 d:2
>>    r:0] }
>>
>>
>> The other tables isn't partitioned.
>>
>> On Sat, May 19, 2018 at 8:36 PM, Mostafa Mokhtar <mm...@cloudera.com>
>> wrote:
>>
>>> Please run “compute stats table_name” against the tables missing stats
>>> in the warnings section.
>>> Then rerun the query and send the new profiles.
>>> With stats the planner should produce more efficient query plans with
>>> less peak memory.
>>>
>>> Thanks
>>> Mostafa
>>>
>>> On May 19, 2018, at 9:38 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>>>
>>> Attached the query profile
>>>
>>> On Sat, May 19, 2018 at 6:52 PM, Mostafa Mokhtar <mm...@cloudera.com>
>>> wrote:
>>>
>>>> Please include the full query profile in your original email.
>>>>
>>>> Thanks
>>>> Mostafa
>>>>
>>>> On May 19, 2018, at 6:47 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>>>>
>>>> Hi Community,
>>>>
>>>> I'm investigating the peak times where the impala daeomns memory were
>>>> consumed so i can distribute my queries in the right way.
>>>>
>>>> While looking into one scenario, i see one query with the below stats:
>>>>
>>>> *** The query has several joins and this is the reason why it take much
>>>> time.
>>>>
>>>> Duration: 4.1m
>>>> Rows Produced: 30102
>>>> Aggregate Peak Memory Usage: 6.5 GiB
>>>> Per Node Peak Memory Usage: 6.5 GiB
>>>> The total number of bytes read from HDFS :727 MiB
>>>> Memory Accrual: 86228522168 byte seconds
>>>> Pool: default-pool
>>>> Query State: FINISHED
>>>> Threads: CPU Time: 8m
>>>>
>>>> ====================
>>>>
>>>> I'm intersting to understand why Aggregate Peak Memory Usage and Per
>>>> Node Peak Memory Usage are identical, while looking in the query profile i
>>>> see it ran several fragemments on different nodes.
>>>>
>>>> Also i see that this query all the times it ran, it has the mentioned 2
>>>> parmeters with identical values:( Daily scheduled query)
>>>>
>>>> What i'm trying to understand:
>>>>
>>>> 1) If the query has several fragments, shouldn't the 2 parmeters be
>>>> different?
>>>>
>>>> 2) Is this scenrio can happen since the HDFS reads byte is small and it
>>>> may cause all the data to be read from single node?
>>>>
>>>> 3) Since i see that the node that was with peak memory is the
>>>> coordinator, and i see that the 2 parameters are identical, is it mean that
>>>> the cordinator also executed most of  the query?
>>>>
>>>> 4) while thinking when these 2 metrics can have the same value is that
>>>> at a particular time there was one node participating in the query
>>>> coordinations and execution which is for sure will be the coordinator node
>>>> and at that point the query has the highest aggregarte memory consumption.
>>>>
>>>> Is my assumption true?
>>>>
>>>> 5) If my previous assumption is true, then, is there anyway to force
>>>> the coordinator to un participate in the query consumption? ( Since have
>>>> 2-3 queries running at the same time with such scenrio will fail)
>>>>
>>>> 6) While looking in the fragments i see the following on one of the
>>>> fragements:
>>>>
>>>> PeakMemoryUsage: 141.1 MiB
>>>> PerHostPeakMemUsage: 6.5 GiB
>>>>
>>>> IS the peakMemoryUsage is refer to the node that ran the specific
>>>> fragment and the perHostPeakMemUsage refers to the node with the peak
>>>> memory cross the cluster in this soecifc query?
>>>>
>>>>
>>>> --
>>>> Take Care
>>>> Fawze Abujaber
>>>>
>>>>
>>>
>>>
>>> --
>>> Take Care
>>> Fawze Abujaber
>>>
>>> <profile.txt>
>>>
>>>
>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>
>


-- 
Take Care
Fawze Abujaber

Re: Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Posted by Mostafa Mokhtar <mm...@cloudera.com>.

If using Spark to load data to be queried by Impala please
use spark.sql.parquet.writeLegacyFormat.

Please set the number of rows to tables and distinct count to the columns
using
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_stats.html



On Sat, May 19, 2018 at 1:56 PM, Fawze Abujaber <fa...@gmail.com> wrote:

> Hi Mustafa,
>
> I cann't run compute stats on the dv tables and i think the reason that
> these tables data writen using spark sql.
>
>
>    - r-00000-5752904c-b197-48c0-8f9b-b629e727a5db.snappy.parquet' has an
>    incompatible Parquet schema for column 'default.dv_
>    engagementendevent_prq_local.dsptlocation'. Column type: STRING,
>    Parquet schema: optional struct EngagementEndEvent [i:-1 d:1 r:0] {
>    optional int64 engagement_end_time [i:201 d:2 r:0] optional int64
>    end_reason [i:202 d:2 r:0] optional byte_array agent_id [i:203 d:2 r:0]
>    optional byte_array service_queue_id [i:204 d:2 r:0] optional int64
>    skill_id [i:205 d:2 r:0] optional int32 interactive_indication [i:206 d:2
>    r:0] optional int64 engagement_set_sequence [i:207 d:2 r:0] optional int64
>    engagement_sequence [i:208 d:2 r:0] optional int32 old_channel [i:209 d:2
>    r:0] optional int64 millis_in_queue [i:210 d:2 r:0] optional int64
>    target_skill_id [i:211 d:2 r:0] optional byte_array target_agent_id [i:212
>    d:2 r:0] optional byte_array transcript_identifier [i:213 d:2 r:0] optional
>    int64 call_connect_time [i:214 d:2 r:0] optional int64
>    conversation_start_time [i:215 d:2 r:0] optional int64
>    precall_ivr_connect_time [i:216 d:2 r:0] optional int64 s_call_request_time
>    [i:217 d:2 r:0] optional int64 s_call_scheduled_time [i:218 d:2 r:0]
>    optional byte_array units [i:219 d:2 r:0] optional byte_array
>    from_phone_number [i:220 d:2 r:0] optional byte_array to_phone_number
>    [i:221 d:2 r:0] optional byte_array did_number [i:222 d:2 r:0] optional
>    int32 from_participant_kind [i:223 d:2 r:0] optional int32
>    to_participant_kind [i:224 d:2 r:0] optional byte_array
>    technical_call_failed_reason [i:225 d:2 r:0] optional byte_array
>    call_disconnector [i:226 d:2 r:0] optional int32 last_call_type [i:227 d:2
>    r:0] optional int64 call_end_reason [i:228 d:2 r:0] optional byte_array
>    agent_session_id [i:229 d:2 r:0] optional int32 channel [i:230 d:2 r:0]
>    optional byte_array skill_name [i:231 d:2 r:0] optional byte_array
>    target_skill_name [i:232 d:2 r:0] optional byte_array service_queue_name
>    [i:233 d:2 r:0] optional int64 engagement_start_time [i:234 d:2 r:0]
>    optional int64 interaction_start_time [i:235 d:2 r:0] optional int64
>    agent_group_id [i:236 d:2 r:0] optional int64 chat_center_id [i:237 d:2
>    r:0] }
>
>
> The other tables isn't partitioned.
>
> On Sat, May 19, 2018 at 8:36 PM, Mostafa Mokhtar <mm...@cloudera.com>
> wrote:
>
>> Please run “compute stats table_name” against the tables missing stats in
>> the warnings section.
>> Then rerun the query and send the new profiles.
>> With stats the planner should produce more efficient query plans with
>> less peak memory.
>>
>> Thanks
>> Mostafa
>>
>> On May 19, 2018, at 9:38 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>>
>> Attached the query profile
>>
>> On Sat, May 19, 2018 at 6:52 PM, Mostafa Mokhtar <mm...@cloudera.com>
>> wrote:
>>
>>> Please include the full query profile in your original email.
>>>
>>> Thanks
>>> Mostafa
>>>
>>> On May 19, 2018, at 6:47 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>>>
>>> Hi Community,
>>>
>>> I'm investigating the peak times where the impala daeomns memory were
>>> consumed so i can distribute my queries in the right way.
>>>
>>> While looking into one scenario, i see one query with the below stats:
>>>
>>> *** The query has several joins and this is the reason why it take much
>>> time.
>>>
>>> Duration: 4.1m
>>> Rows Produced: 30102
>>> Aggregate Peak Memory Usage: 6.5 GiB
>>> Per Node Peak Memory Usage: 6.5 GiB
>>> The total number of bytes read from HDFS :727 MiB
>>> Memory Accrual: 86228522168 byte seconds
>>> Pool: default-pool
>>> Query State: FINISHED
>>> Threads: CPU Time: 8m
>>>
>>> ====================
>>>
>>> I'm intersting to understand why Aggregate Peak Memory Usage and Per
>>> Node Peak Memory Usage are identical, while looking in the query profile i
>>> see it ran several fragemments on different nodes.
>>>
>>> Also i see that this query all the times it ran, it has the mentioned 2
>>> parmeters with identical values:( Daily scheduled query)
>>>
>>> What i'm trying to understand:
>>>
>>> 1) If the query has several fragments, shouldn't the 2 parmeters be
>>> different?
>>>
>>> 2) Is this scenrio can happen since the HDFS reads byte is small and it
>>> may cause all the data to be read from single node?
>>>
>>> 3) Since i see that the node that was with peak memory is the
>>> coordinator, and i see that the 2 parameters are identical, is it mean that
>>> the cordinator also executed most of  the query?
>>>
>>> 4) while thinking when these 2 metrics can have the same value is that
>>> at a particular time there was one node participating in the query
>>> coordinations and execution which is for sure will be the coordinator node
>>> and at that point the query has the highest aggregarte memory consumption.
>>>
>>> Is my assumption true?
>>>
>>> 5) If my previous assumption is true, then, is there anyway to force the
>>> coordinator to un participate in the query consumption? ( Since have 2-3
>>> queries running at the same time with such scenrio will fail)
>>>
>>> 6) While looking in the fragments i see the following on one of the
>>> fragements:
>>>
>>> PeakMemoryUsage: 141.1 MiB
>>> PerHostPeakMemUsage: 6.5 GiB
>>>
>>> IS the peakMemoryUsage is refer to the node that ran the specific
>>> fragment and the perHostPeakMemUsage refers to the node with the peak
>>> memory cross the cluster in this soecifc query?
>>>
>>>
>>> --
>>> Take Care
>>> Fawze Abujaber
>>>
>>>
>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>> <profile.txt>
>>
>>
>
>
> --
> Take Care
> Fawze Abujaber
>

Re: Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Posted by Fawze Abujaber <fa...@gmail.com>.

Hi Mustafa,

I cann't run compute stats on the dv tables and i think the reason that
these tables data writen using spark sql.


   - r-00000-5752904c-b197-48c0-8f9b-b629e727a5db.snappy.parquet' has an
   incompatible Parquet schema for column
   'default.dv_engagementendevent_prq_local.dsptlocation'. Column type:
   STRING, Parquet schema: optional struct EngagementEndEvent [i:-1 d:1 r:0] {
   optional int64 engagement_end_time [i:201 d:2 r:0] optional int64
   end_reason [i:202 d:2 r:0] optional byte_array agent_id [i:203 d:2 r:0]
   optional byte_array service_queue_id [i:204 d:2 r:0] optional int64
   skill_id [i:205 d:2 r:0] optional int32 interactive_indication [i:206 d:2
   r:0] optional int64 engagement_set_sequence [i:207 d:2 r:0] optional int64
   engagement_sequence [i:208 d:2 r:0] optional int32 old_channel [i:209 d:2
   r:0] optional int64 millis_in_queue [i:210 d:2 r:0] optional int64
   target_skill_id [i:211 d:2 r:0] optional byte_array target_agent_id [i:212
   d:2 r:0] optional byte_array transcript_identifier [i:213 d:2 r:0] optional
   int64 call_connect_time [i:214 d:2 r:0] optional int64
   conversation_start_time [i:215 d:2 r:0] optional int64
   precall_ivr_connect_time [i:216 d:2 r:0] optional int64 s_call_request_time
   [i:217 d:2 r:0] optional int64 s_call_scheduled_time [i:218 d:2 r:0]
   optional byte_array units [i:219 d:2 r:0] optional byte_array
   from_phone_number [i:220 d:2 r:0] optional byte_array to_phone_number
   [i:221 d:2 r:0] optional byte_array did_number [i:222 d:2 r:0] optional
   int32 from_participant_kind [i:223 d:2 r:0] optional int32
   to_participant_kind [i:224 d:2 r:0] optional byte_array
   technical_call_failed_reason [i:225 d:2 r:0] optional byte_array
   call_disconnector [i:226 d:2 r:0] optional int32 last_call_type [i:227 d:2
   r:0] optional int64 call_end_reason [i:228 d:2 r:0] optional byte_array
   agent_session_id [i:229 d:2 r:0] optional int32 channel [i:230 d:2 r:0]
   optional byte_array skill_name [i:231 d:2 r:0] optional byte_array
   target_skill_name [i:232 d:2 r:0] optional byte_array service_queue_name
   [i:233 d:2 r:0] optional int64 engagement_start_time [i:234 d:2 r:0]
   optional int64 interaction_start_time [i:235 d:2 r:0] optional int64
   agent_group_id [i:236 d:2 r:0] optional int64 chat_center_id [i:237 d:2
   r:0] }


The other tables isn't partitioned.

On Sat, May 19, 2018 at 8:36 PM, Mostafa Mokhtar <mm...@cloudera.com>
wrote:

> Please run “compute stats table_name” against the tables missing stats in
> the warnings section.
> Then rerun the query and send the new profiles.
> With stats the planner should produce more efficient query plans with less
> peak memory.
>
> Thanks
> Mostafa
>
> On May 19, 2018, at 9:38 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>
> Attached the query profile
>
> On Sat, May 19, 2018 at 6:52 PM, Mostafa Mokhtar <mm...@cloudera.com>
> wrote:
>
>> Please include the full query profile in your original email.
>>
>> Thanks
>> Mostafa
>>
>> On May 19, 2018, at 6:47 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>>
>> Hi Community,
>>
>> I'm investigating the peak times where the impala daeomns memory were
>> consumed so i can distribute my queries in the right way.
>>
>> While looking into one scenario, i see one query with the below stats:
>>
>> *** The query has several joins and this is the reason why it take much
>> time.
>>
>> Duration: 4.1m
>> Rows Produced: 30102
>> Aggregate Peak Memory Usage: 6.5 GiB
>> Per Node Peak Memory Usage: 6.5 GiB
>> The total number of bytes read from HDFS :727 MiB
>> Memory Accrual: 86228522168 byte seconds
>> Pool: default-pool
>> Query State: FINISHED
>> Threads: CPU Time: 8m
>>
>> ====================
>>
>> I'm intersting to understand why Aggregate Peak Memory Usage and Per Node
>> Peak Memory Usage are identical, while looking in the query profile i see
>> it ran several fragemments on different nodes.
>>
>> Also i see that this query all the times it ran, it has the mentioned 2
>> parmeters with identical values:( Daily scheduled query)
>>
>> What i'm trying to understand:
>>
>> 1) If the query has several fragments, shouldn't the 2 parmeters be
>> different?
>>
>> 2) Is this scenrio can happen since the HDFS reads byte is small and it
>> may cause all the data to be read from single node?
>>
>> 3) Since i see that the node that was with peak memory is the
>> coordinator, and i see that the 2 parameters are identical, is it mean that
>> the cordinator also executed most of  the query?
>>
>> 4) while thinking when these 2 metrics can have the same value is that at
>> a particular time there was one node participating in the query
>> coordinations and execution which is for sure will be the coordinator node
>> and at that point the query has the highest aggregarte memory consumption.
>>
>> Is my assumption true?
>>
>> 5) If my previous assumption is true, then, is there anyway to force the
>> coordinator to un participate in the query consumption? ( Since have 2-3
>> queries running at the same time with such scenrio will fail)
>>
>> 6) While looking in the fragments i see the following on one of the
>> fragements:
>>
>> PeakMemoryUsage: 141.1 MiB
>> PerHostPeakMemUsage: 6.5 GiB
>>
>> IS the peakMemoryUsage is refer to the node that ran the specific
>> fragment and the perHostPeakMemUsage refers to the node with the peak
>> memory cross the cluster in this soecifc query?
>>
>>
>> --
>> Take Care
>> Fawze Abujaber
>>
>>
>
>
> --
> Take Care
> Fawze Abujaber
>
> <profile.txt>
>
>


-- 
Take Care
Fawze Abujaber

Re: Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Posted by Mostafa Mokhtar <mm...@cloudera.com>.

Please run “compute stats table_name” against the tables missing stats in the warnings section.
Then rerun the query and send the new profiles. 
With stats the planner should produce more efficient query plans with less peak memory. 

Thanks 
Mostafa

> On May 19, 2018, at 9:38 AM, Fawze Abujaber <fa...@gmail.com> wrote:
> 
> Attached the query profile
> 
>> On Sat, May 19, 2018 at 6:52 PM, Mostafa Mokhtar <mm...@cloudera.com> wrote:
>> Please include the full query profile in your original email. 
>> 
>> Thanks 
>> Mostafa
>> 
>>> On May 19, 2018, at 6:47 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>>> 
>>> Hi Community,
>>> 
>>> I'm investigating the peak times where the impala daeomns memory were consumed so i can distribute my queries in the right way.
>>> 
>>> While looking into one scenario, i see one query with the below stats:
>>> 
>>> *** The query has several joins and this is the reason why it take much time.
>>> 
>>> Duration: 4.1m
>>> Rows Produced: 30102
>>> Aggregate Peak Memory Usage: 6.5 GiB
>>> Per Node Peak Memory Usage: 6.5 GiB 
>>> The total number of bytes read from HDFS :727 MiB
>>> Memory Accrual: 86228522168 byte seconds
>>> Pool: default-pool  
>>> Query State: FINISHED
>>> Threads: CPU Time: 8m
>>> 
>>> ====================
>>> 
>>> I'm intersting to understand why Aggregate Peak Memory Usage and Per Node Peak Memory Usage are identical, while looking in the query profile i see it ran several fragemments on different nodes.
>>> 
>>> Also i see that this query all the times it ran, it has the mentioned 2 parmeters with identical values:( Daily scheduled query)
>>> 
>>> What i'm trying to understand:
>>> 
>>> 1) If the query has several fragments, shouldn't the 2 parmeters be different?
>>> 
>>> 2) Is this scenrio can happen since the HDFS reads byte is small and it may cause all the data to be read from single node?
>>> 
>>> 3) Since i see that the node that was with peak memory is the coordinator, and i see that the 2 parameters are identical, is it mean that the cordinator also executed most of  the query?
>>> 
>>> 4) while thinking when these 2 metrics can have the same value is that at a particular time there was one node participating in the query coordinations and execution which is for sure will be the coordinator node and at that point the query has the highest aggregarte memory consumption.
>>> 
>>> Is my assumption true?
>>> 
>>> 5) If my previous assumption is true, then, is there anyway to force the coordinator to un participate in the query consumption? ( Since have 2-3 queries running at the same time with such scenrio will fail) 
>>> 
>>> 6) While looking in the fragments i see the following on one of the fragements:
>>> 
>>> PeakMemoryUsage: 141.1 MiB
>>> PerHostPeakMemUsage: 6.5 GiB
>>> 
>>> IS the peakMemoryUsage is refer to the node that ran the specific fragment and the perHostPeakMemUsage refers to the node with the peak memory cross the cluster in this soecifc query?
>>> 
>>> 
>>> -- 
>>> Take Care
>>> Fawze Abujaber
> 
> 
> 
> -- 
> Take Care
> Fawze Abujaber
> <profile.txt>

Re: Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Posted by Fawze Abujaber <fa...@gmail.com>.

Attached the query profile

On Sat, May 19, 2018 at 6:52 PM, Mostafa Mokhtar <mm...@cloudera.com>
wrote:

> Please include the full query profile in your original email.
>
> Thanks
> Mostafa
>
> On May 19, 2018, at 6:47 AM, Fawze Abujaber <fa...@gmail.com> wrote:
>
> Hi Community,
>
> I'm investigating the peak times where the impala daeomns memory were
> consumed so i can distribute my queries in the right way.
>
> While looking into one scenario, i see one query with the below stats:
>
> *** The query has several joins and this is the reason why it take much
> time.
>
> Duration: 4.1m
> Rows Produced: 30102
> Aggregate Peak Memory Usage: 6.5 GiB
> Per Node Peak Memory Usage: 6.5 GiB
> The total number of bytes read from HDFS :727 MiB
> Memory Accrual: 86228522168 byte seconds
> Pool: default-pool
> Query State: FINISHED
> Threads: CPU Time: 8m
>
> ====================
>
> I'm intersting to understand why Aggregate Peak Memory Usage and Per Node
> Peak Memory Usage are identical, while looking in the query profile i see
> it ran several fragemments on different nodes.
>
> Also i see that this query all the times it ran, it has the mentioned 2
> parmeters with identical values:( Daily scheduled query)
>
> What i'm trying to understand:
>
> 1) If the query has several fragments, shouldn't the 2 parmeters be
> different?
>
> 2) Is this scenrio can happen since the HDFS reads byte is small and it
> may cause all the data to be read from single node?
>
> 3) Since i see that the node that was with peak memory is the coordinator,
> and i see that the 2 parameters are identical, is it mean that the
> cordinator also executed most of  the query?
>
> 4) while thinking when these 2 metrics can have the same value is that at
> a particular time there was one node participating in the query
> coordinations and execution which is for sure will be the coordinator node
> and at that point the query has the highest aggregarte memory consumption.
>
> Is my assumption true?
>
> 5) If my previous assumption is true, then, is there anyway to force the
> coordinator to un participate in the query consumption? ( Since have 2-3
> queries running at the same time with such scenrio will fail)
>
> 6) While looking in the fragments i see the following on one of the
> fragements:
>
> PeakMemoryUsage: 141.1 MiB
> PerHostPeakMemUsage: 6.5 GiB
>
> IS the peakMemoryUsage is refer to the node that ran the specific fragment
> and the perHostPeakMemUsage refers to the node with the peak memory cross
> the cluster in this soecifc query?
>
>
> --
> Take Care
> Fawze Abujaber
>
>


-- 
Take Care
Fawze Abujaber

Re: Aggregate Peak Memory Usage and Per Node Peak Memory Usage

Posted by Mostafa Mokhtar <mm...@cloudera.com>.

Please include the full query profile in your original email. 

Thanks 
Mostafa

> On May 19, 2018, at 6:47 AM, Fawze Abujaber <fa...@gmail.com> wrote:
> 
> Hi Community,
> 
> I'm investigating the peak times where the impala daeomns memory were consumed so i can distribute my queries in the right way.
> 
> While looking into one scenario, i see one query with the below stats:
> 
> *** The query has several joins and this is the reason why it take much time.
> 
> Duration: 4.1m
> Rows Produced: 30102
> Aggregate Peak Memory Usage: 6.5 GiB
> Per Node Peak Memory Usage: 6.5 GiB 
> The total number of bytes read from HDFS :727 MiB
> Memory Accrual: 86228522168 byte seconds
> Pool: default-pool  
> Query State: FINISHED
> Threads: CPU Time: 8m
> 
> ====================
> 
> I'm intersting to understand why Aggregate Peak Memory Usage and Per Node Peak Memory Usage are identical, while looking in the query profile i see it ran several fragemments on different nodes.
> 
> Also i see that this query all the times it ran, it has the mentioned 2 parmeters with identical values:( Daily scheduled query)
> 
> What i'm trying to understand:
> 
> 1) If the query has several fragments, shouldn't the 2 parmeters be different?
> 
> 2) Is this scenrio can happen since the HDFS reads byte is small and it may cause all the data to be read from single node?
> 
> 3) Since i see that the node that was with peak memory is the coordinator, and i see that the 2 parameters are identical, is it mean that the cordinator also executed most of  the query?
> 
> 4) while thinking when these 2 metrics can have the same value is that at a particular time there was one node participating in the query coordinations and execution which is for sure will be the coordinator node and at that point the query has the highest aggregarte memory consumption.
> 
> Is my assumption true?
> 
> 5) If my previous assumption is true, then, is there anyway to force the coordinator to un participate in the query consumption? ( Since have 2-3 queries running at the same time with such scenrio will fail) 
> 
> 6) While looking in the fragments i see the following on one of the fragements:
> 
> PeakMemoryUsage: 141.1 MiB
> PerHostPeakMemUsage: 6.5 GiB
> 
> IS the peakMemoryUsage is refer to the node that ran the specific fragment and the perHostPeakMemUsage refers to the node with the peak memory cross the cluster in this soecifc query?
> 
> 
> -- 
> Take Care
> Fawze Abujaber