You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by rahul challapalli <ch...@gmail.com> on 2016/05/26 00:50:49 UTC

Hash Aggregate Memory usage

I am trying to understand the memory usage patterns for hash aggregate. The
below query completes in 9.163 seconds and uses 24 MB of memory for
hash-aggregate (according to profile)

select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by d.c1,
d.c2, d.c3, d.c4, d.c5;

Adding one more constant column to the group by, the below query takes
11.638 seconds and uses 29 MB of ram

select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from mem_heavy1) d group
by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;

The below query with one more constant column added to group by 14.622
seconds and uses 33 MB memory

select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;


As you can see, there is only one disctinct group in all the above cases.
It looks like the memory usage is proportional to no of elements in the
group by clause. Is this expected?

Is the increase in time expected between the above queries? (As we did not
introduce any new groups)

- Rahul

Re: Hash Aggregate Memory usage

Posted by rahul challapalli <ch...@gmail.com>.

Thanks for the information Jacques.

Based on the above formula from Jacques, the hash agg operator should not
be using ~33MB memory when we have only 1 group ( and 7 varchar columns in
the group). As per Aman's suggestion, I tried using fixed width columns in
the group by and the memory usage went down. The below query returns in
7.071 seconds and consumes 4MB of peak memory (based on profile). Even 4MB
seems to be quite large based on the above formula.

- Rahul

On Fri, May 27, 2016 at 11:44 AM, Jacques Nadeau <ja...@dremio.com> wrote:

> There was a presentation a year or so ago I presented at the MapR sales
> kickoff that covers the memory characteristics of operators. Unfortunately,
> I don't have access to the content but hopefully someone internal to MapR
> should have it. (Maybe Ellen or Neeraja)
>
> Approximately (from memory):
>
> total hash aggregate size = entries * (links (4 bytes) + hash code (4
> bytes) + aggregate key size + aggregate workspace variable size)
> aggregate key size = (fixed value size of all keys + variable value size
> for all keys)
> fixed value size = fixed width size (e.g. 4 bytes for a four byte int) +
> nullability (1 or 0 bytes)
> variable value size = offset (4 bytes)  + length of data  + nullability (1
> or 0 bytes)
> aggregate workspace variable size = each function field *  value size
>
> Note that the entries is actually based on the nearest power of two.
> Additionally, every vector is also rounded up to the nearest power of two.
> (this includes both the key vectors, workspace vectors links and hash code
> vectors
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, May 27, 2016 at 11:21 AM, Aman Sinha <am...@apache.org> wrote:
>
> > Rahul,  can you send me the query profile separately ?  Also, can you try
> > group-by on fixed-width columns instead of Varchar ?
> > With single group, the hash table itself should be consuming relatively
> > small amount of memory.
> >
> > On Fri, May 27, 2016 at 11:14 AM, Zelaine Fong <zf...@maprtech.com>
> wrote:
> >
> > > My guess would be that for hashing, a hash table is pre-allocated based
> > on
> > > the number of keys in the hash.  That would explain why with more keys,
> > the
> > > memory usage grows.  But that's just my guess.  Someone who really
> > > understands how this works should chime in :).
> > >
> > > -- Zelaine
> > >
> > > On Fri, May 27, 2016 at 10:36 AM, rahul challapalli <
> > > challapallirahul@gmail.com> wrote:
> > >
> > > > Any inputs on this one?
> > > >
> > > > On Wed, May 25, 2016 at 7:51 PM, rahul challapalli <
> > > > challapallirahul@gmail.com> wrote:
> > > >
> > > > > Its using hash aggregation.
> > > > > On May 25, 2016 7:48 PM, "Zelaine Fong" <zf...@maprtech.com>
> wrote:
> > > > >
> > > > >> What does the explain plan show?  I.e., is the group by being done
> > > via a
> > > > >> hash agg or a streaming agg?  If it's a streaming agg, then you
> > still
> > > > have
> > > > >> to sort the entire data set before you reduce it down to a single
> > > group.
> > > > >> That would explain the increase in memory as you add group by
> keys.
> > > > >>
> > > > >> -- Zelaine
> > > > >>
> > > > >> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
> > > > >> challapallirahul@gmail.com> wrote:
> > > > >>
> > > > >> > I am trying to understand the memory usage patterns for hash
> > > > aggregate.
> > > > >> The
> > > > >> > below query completes in 9.163 seconds and uses 24 MB of memory
> > for
> > > > >> > hash-aggregate (according to profile)
> > > > >> >
> > > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > > 'kfjhl'
> > > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group
> by
> > > > d.c1,
> > > > >> > d.c2, d.c3, d.c4, d.c5;
> > > > >> >
> > > > >> > Adding one more constant column to the group by, the below query
> > > takes
> > > > >> > 11.638 seconds and uses 29 MB of ram
> > > > >> >
> > > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > > 'kfjhl'
> > > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from
> > mem_heavy1) d
> > > > >> group
> > > > >> > by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
> > > > >> >
> > > > >> > The below query with one more constant column added to group by
> > > 14.622
> > > > >> > seconds and uses 33 MB memory
> > > > >> >
> > > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > > 'kfjhl'
> > > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7
> from
> > > > >> > mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
> > > > >> >
> > > > >> >
> > > > >> > As you can see, there is only one disctinct group in all the
> above
> > > > >> cases.
> > > > >> > It looks like the memory usage is proportional to no of elements
> > in
> > > > the
> > > > >> > group by clause. Is this expected?
> > > > >> >
> > > > >> > Is the increase in time expected between the above queries? (As
> we
> > > did
> > > > >> not
> > > > >> > introduce any new groups)
> > > > >> >
> > > > >> > - Rahul
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Hash Aggregate Memory usage

Posted by Jacques Nadeau <ja...@dremio.com>.

There was a presentation a year or so ago I presented at the MapR sales
kickoff that covers the memory characteristics of operators. Unfortunately,
I don't have access to the content but hopefully someone internal to MapR
should have it. (Maybe Ellen or Neeraja)

Approximately (from memory):

total hash aggregate size = entries * (links (4 bytes) + hash code (4
bytes) + aggregate key size + aggregate workspace variable size)
aggregate key size = (fixed value size of all keys + variable value size
for all keys)
fixed value size = fixed width size (e.g. 4 bytes for a four byte int) +
nullability (1 or 0 bytes)
variable value size = offset (4 bytes)  + length of data  + nullability (1
or 0 bytes)
aggregate workspace variable size = each function field *  value size

Note that the entries is actually based on the nearest power of two.
Additionally, every vector is also rounded up to the nearest power of two.
(this includes both the key vectors, workspace vectors links and hash code
vectors

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, May 27, 2016 at 11:21 AM, Aman Sinha <am...@apache.org> wrote:

> Rahul,  can you send me the query profile separately ?  Also, can you try
> group-by on fixed-width columns instead of Varchar ?
> With single group, the hash table itself should be consuming relatively
> small amount of memory.
>
> On Fri, May 27, 2016 at 11:14 AM, Zelaine Fong <zf...@maprtech.com> wrote:
>
> > My guess would be that for hashing, a hash table is pre-allocated based
> on
> > the number of keys in the hash.  That would explain why with more keys,
> the
> > memory usage grows.  But that's just my guess.  Someone who really
> > understands how this works should chime in :).
> >
> > -- Zelaine
> >
> > On Fri, May 27, 2016 at 10:36 AM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > Any inputs on this one?
> > >
> > > On Wed, May 25, 2016 at 7:51 PM, rahul challapalli <
> > > challapallirahul@gmail.com> wrote:
> > >
> > > > Its using hash aggregation.
> > > > On May 25, 2016 7:48 PM, "Zelaine Fong" <zf...@maprtech.com> wrote:
> > > >
> > > >> What does the explain plan show?  I.e., is the group by being done
> > via a
> > > >> hash agg or a streaming agg?  If it's a streaming agg, then you
> still
> > > have
> > > >> to sort the entire data set before you reduce it down to a single
> > group.
> > > >> That would explain the increase in memory as you add group by keys.
> > > >>
> > > >> -- Zelaine
> > > >>
> > > >> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
> > > >> challapallirahul@gmail.com> wrote:
> > > >>
> > > >> > I am trying to understand the memory usage patterns for hash
> > > aggregate.
> > > >> The
> > > >> > below query completes in 9.163 seconds and uses 24 MB of memory
> for
> > > >> > hash-aggregate (according to profile)
> > > >> >
> > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > 'kfjhl'
> > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by
> > > d.c1,
> > > >> > d.c2, d.c3, d.c4, d.c5;
> > > >> >
> > > >> > Adding one more constant column to the group by, the below query
> > takes
> > > >> > 11.638 seconds and uses 29 MB of ram
> > > >> >
> > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > 'kfjhl'
> > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from
> mem_heavy1) d
> > > >> group
> > > >> > by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
> > > >> >
> > > >> > The below query with one more constant column added to group by
> > 14.622
> > > >> > seconds and uses 33 MB memory
> > > >> >
> > > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > > 'kfjhl'
> > > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
> > > >> > mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
> > > >> >
> > > >> >
> > > >> > As you can see, there is only one disctinct group in all the above
> > > >> cases.
> > > >> > It looks like the memory usage is proportional to no of elements
> in
> > > the
> > > >> > group by clause. Is this expected?
> > > >> >
> > > >> > Is the increase in time expected between the above queries? (As we
> > did
> > > >> not
> > > >> > introduce any new groups)
> > > >> >
> > > >> > - Rahul
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Hash Aggregate Memory usage

Posted by Aman Sinha <am...@apache.org>.

Rahul,  can you send me the query profile separately ?  Also, can you try
group-by on fixed-width columns instead of Varchar ?
With single group, the hash table itself should be consuming relatively
small amount of memory.

On Fri, May 27, 2016 at 11:14 AM, Zelaine Fong <zf...@maprtech.com> wrote:

> My guess would be that for hashing, a hash table is pre-allocated based on
> the number of keys in the hash.  That would explain why with more keys, the
> memory usage grows.  But that's just my guess.  Someone who really
> understands how this works should chime in :).
>
> -- Zelaine
>
> On Fri, May 27, 2016 at 10:36 AM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Any inputs on this one?
> >
> > On Wed, May 25, 2016 at 7:51 PM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > Its using hash aggregation.
> > > On May 25, 2016 7:48 PM, "Zelaine Fong" <zf...@maprtech.com> wrote:
> > >
> > >> What does the explain plan show?  I.e., is the group by being done
> via a
> > >> hash agg or a streaming agg?  If it's a streaming agg, then you still
> > have
> > >> to sort the entire data set before you reduce it down to a single
> group.
> > >> That would explain the increase in memory as you add group by keys.
> > >>
> > >> -- Zelaine
> > >>
> > >> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
> > >> challapallirahul@gmail.com> wrote:
> > >>
> > >> > I am trying to understand the memory usage patterns for hash
> > aggregate.
> > >> The
> > >> > below query completes in 9.163 seconds and uses 24 MB of memory for
> > >> > hash-aggregate (according to profile)
> > >> >
> > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > 'kfjhl'
> > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by
> > d.c1,
> > >> > d.c2, d.c3, d.c4, d.c5;
> > >> >
> > >> > Adding one more constant column to the group by, the below query
> takes
> > >> > 11.638 seconds and uses 29 MB of ram
> > >> >
> > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > 'kfjhl'
> > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from mem_heavy1) d
> > >> group
> > >> > by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
> > >> >
> > >> > The below query with one more constant column added to group by
> 14.622
> > >> > seconds and uses 33 MB memory
> > >> >
> > >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> > 'kfjhl'
> > >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
> > >> > mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
> > >> >
> > >> >
> > >> > As you can see, there is only one disctinct group in all the above
> > >> cases.
> > >> > It looks like the memory usage is proportional to no of elements in
> > the
> > >> > group by clause. Is this expected?
> > >> >
> > >> > Is the increase in time expected between the above queries? (As we
> did
> > >> not
> > >> > introduce any new groups)
> > >> >
> > >> > - Rahul
> > >> >
> > >>
> > >
> >
>

Re: Hash Aggregate Memory usage

Posted by Zelaine Fong <zf...@maprtech.com>.

My guess would be that for hashing, a hash table is pre-allocated based on
the number of keys in the hash.  That would explain why with more keys, the
memory usage grows.  But that's just my guess.  Someone who really
understands how this works should chime in :).

-- Zelaine

On Fri, May 27, 2016 at 10:36 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Any inputs on this one?
>
> On Wed, May 25, 2016 at 7:51 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Its using hash aggregation.
> > On May 25, 2016 7:48 PM, "Zelaine Fong" <zf...@maprtech.com> wrote:
> >
> >> What does the explain plan show?  I.e., is the group by being done via a
> >> hash agg or a streaming agg?  If it's a streaming agg, then you still
> have
> >> to sort the entire data set before you reduce it down to a single group.
> >> That would explain the increase in memory as you add group by keys.
> >>
> >> -- Zelaine
> >>
> >> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
> >> challapallirahul@gmail.com> wrote:
> >>
> >> > I am trying to understand the memory usage patterns for hash
> aggregate.
> >> The
> >> > below query completes in 9.163 seconds and uses 24 MB of memory for
> >> > hash-aggregate (according to profile)
> >> >
> >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> 'kfjhl'
> >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by
> d.c1,
> >> > d.c2, d.c3, d.c4, d.c5;
> >> >
> >> > Adding one more constant column to the group by, the below query takes
> >> > 11.638 seconds and uses 29 MB of ram
> >> >
> >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> 'kfjhl'
> >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from mem_heavy1) d
> >> group
> >> > by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
> >> >
> >> > The below query with one more constant column added to group by 14.622
> >> > seconds and uses 33 MB memory
> >> >
> >> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1,
> 'kfjhl'
> >> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
> >> > mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
> >> >
> >> >
> >> > As you can see, there is only one disctinct group in all the above
> >> cases.
> >> > It looks like the memory usage is proportional to no of elements in
> the
> >> > group by clause. Is this expected?
> >> >
> >> > Is the increase in time expected between the above queries? (As we did
> >> not
> >> > introduce any new groups)
> >> >
> >> > - Rahul
> >> >
> >>
> >
>

Re: Hash Aggregate Memory usage

Posted by rahul challapalli <ch...@gmail.com>.

Any inputs on this one?

On Wed, May 25, 2016 at 7:51 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Its using hash aggregation.
> On May 25, 2016 7:48 PM, "Zelaine Fong" <zf...@maprtech.com> wrote:
>
>> What does the explain plan show?  I.e., is the group by being done via a
>> hash agg or a streaming agg?  If it's a streaming agg, then you still have
>> to sort the entire data set before you reduce it down to a single group.
>> That would explain the increase in memory as you add group by keys.
>>
>> -- Zelaine
>>
>> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
>> challapallirahul@gmail.com> wrote:
>>
>> > I am trying to understand the memory usage patterns for hash aggregate.
>> The
>> > below query completes in 9.163 seconds and uses 24 MB of memory for
>> > hash-aggregate (according to profile)
>> >
>> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
>> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by d.c1,
>> > d.c2, d.c3, d.c4, d.c5;
>> >
>> > Adding one more constant column to the group by, the below query takes
>> > 11.638 seconds and uses 29 MB of ram
>> >
>> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
>> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from mem_heavy1) d
>> group
>> > by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
>> >
>> > The below query with one more constant column added to group by 14.622
>> > seconds and uses 33 MB memory
>> >
>> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
>> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
>> > mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
>> >
>> >
>> > As you can see, there is only one disctinct group in all the above
>> cases.
>> > It looks like the memory usage is proportional to no of elements in the
>> > group by clause. Is this expected?
>> >
>> > Is the increase in time expected between the above queries? (As we did
>> not
>> > introduce any new groups)
>> >
>> > - Rahul
>> >
>>
>

Re: Hash Aggregate Memory usage

Posted by rahul challapalli <ch...@gmail.com>.

Its using hash aggregation.
On May 25, 2016 7:48 PM, "Zelaine Fong" <zf...@maprtech.com> wrote:

> What does the explain plan show?  I.e., is the group by being done via a
> hash agg or a streaming agg?  If it's a streaming agg, then you still have
> to sort the entire data set before you reduce it down to a single group.
> That would explain the increase in memory as you add group by keys.
>
> -- Zelaine
>
> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > I am trying to understand the memory usage patterns for hash aggregate.
> The
> > below query completes in 9.163 seconds and uses 24 MB of memory for
> > hash-aggregate (according to profile)
> >
> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by d.c1,
> > d.c2, d.c3, d.c4, d.c5;
> >
> > Adding one more constant column to the group by, the below query takes
> > 11.638 seconds and uses 29 MB of ram
> >
> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from mem_heavy1) d
> group
> > by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
> >
> > The below query with one more constant column added to group by 14.622
> > seconds and uses 33 MB memory
> >
> > select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
> > c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
> > mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
> >
> >
> > As you can see, there is only one disctinct group in all the above cases.
> > It looks like the memory usage is proportional to no of elements in the
> > group by clause. Is this expected?
> >
> > Is the increase in time expected between the above queries? (As we did
> not
> > introduce any new groups)
> >
> > - Rahul
> >
>

Re: Hash Aggregate Memory usage

Posted by Zelaine Fong <zf...@maprtech.com>.

Oops, my bad.  I just noticed you did indicate that the query plan shows
usage of a hash agg.

-- Zelaine

On Wed, May 25, 2016 at 7:48 PM, Zelaine Fong <zf...@maprtech.com> wrote:

> What does the explain plan show?  I.e., is the group by being done via a
> hash agg or a streaming agg?  If it's a streaming agg, then you still have
> to sort the entire data set before you reduce it down to a single group.
> That would explain the increase in memory as you add group by keys.
>
> -- Zelaine
>
> On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
>> I am trying to understand the memory usage patterns for hash aggregate.
>> The
>> below query completes in 9.163 seconds and uses 24 MB of memory for
>> hash-aggregate (according to profile)
>>
>> select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
>> c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by d.c1,
>> d.c2, d.c3, d.c4, d.c5;
>>
>> Adding one more constant column to the group by, the below query takes
>> 11.638 seconds and uses 29 MB of ram
>>
>> select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
>> c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from mem_heavy1) d group
>> by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
>>
>> The below query with one more constant column added to group by 14.622
>> seconds and uses 33 MB memory
>>
>> select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
>> c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
>> mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
>>
>>
>> As you can see, there is only one disctinct group in all the above cases.
>> It looks like the memory usage is proportional to no of elements in the
>> group by clause. Is this expected?
>>
>> Is the increase in time expected between the above queries? (As we did not
>> introduce any new groups)
>>
>> - Rahul
>>
>
>

Re: Hash Aggregate Memory usage

Posted by Zelaine Fong <zf...@maprtech.com>.

What does the explain plan show?  I.e., is the group by being done via a
hash agg or a streaming agg?  If it's a streaming agg, then you still have
to sort the entire data set before you reduce it down to a single group.
That would explain the increase in memory as you add group by keys.

-- Zelaine

On Wed, May 25, 2016 at 5:50 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> I am trying to understand the memory usage patterns for hash aggregate. The
> below query completes in 9.163 seconds and uses 24 MB of memory for
> hash-aggregate (according to profile)
>
> select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
> c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5  from mem_heavy1) d group by d.c1,
> d.c2, d.c3, d.c4, d.c5;
>
> Adding one more constant column to the group by, the below query takes
> 11.638 seconds and uses 29 MB of ram
>
> select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
> c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6  from mem_heavy1) d group
> by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6;
>
> The below query with one more constant column added to group by 14.622
> seconds and uses 33 MB memory
>
> select max(d.l_linenumber) from (select l_linenumber, 'asdf' c1, 'kfjhl'
> c2, 'reyui' c3, 'khdfs' c4, 'vkhj' c5, 'bmkr' c6, 'ciuh' c7  from
> mem_heavy1) d group by d.c1, d.c2, d.c3, d.c4, d.c5, d.c6, d.c7;
>
>
> As you can see, there is only one disctinct group in all the above cases.
> It looks like the memory usage is proportional to no of elements in the
> group by clause. Is this expected?
>
> Is the increase in time expected between the above queries? (As we did not
> introduce any new groups)
>
> - Rahul
>