You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hawq.apache.org by Jon Roberts <jr...@pivotal.io> on 2016/12/06 17:49:26 UTC

hawq_rm_nvseg_perquery_perseg_limit

I've been testing TPC-DS queries and found that I can get Randomly
Distributed tables to outperform Hash Distributed tables by increasing
hawq_rm_nvseg_perquery_perseg_limit on a per query basis to as high as 24.

For Hash Distributed tables, 24 is way too high.  It is also not a great
idea to make the default so high in case users are creating a mix of Random
and Hash Distributed Tables.

Would it be possible to make this one GUC separated into two so that you
can leave it 6 for Hash Distributed tables but another value like 16 for
Randomly Distributed tables?

This enhancement would also make it possible for later improvements in the
optimizer to determine how many vsegs to use.  For example, some queries
worked best set to 12 while others greatly benefited when set to 24.


Jon Roberts

Re: hawq_rm_nvseg_perquery_perseg_limit

Posted by Hubert Zhang <hz...@pivotal.io>.
Hi Jon,
    You mentioned by increasing GUC hawq_rm_nvseg_perquery_perseg_limit
to12 or 24, performance would get improved. It's true because we use more
processes/resources to finish the query.  But improvement is not linear
because of process init/communication cost. Moreover the total resource is
limited, if one query use too many resource, it  would reduce the
concurrency of system. The reason of why to set this GUC to 6 also consider
the balance between performance and concurrency.


On Wed, Dec 7, 2016 at 7:01 AM, Yi Jin <yj...@pivotal.io> wrote:

> I checked the code for describing table using commands \d and \d+. I think
> in current version we cannot get the bucket number through just \d command.
> Maybe we can get a chance to improve this in future. This should be easy to
> add because describe uses sql statements to query expected information.
>
>
> On Wed, Dec 7, 2016 at 8:20 AM, Jon Roberts <jr...@pivotal.io> wrote:
>
>> Oh, I missed that there is default_hash_table_bucket_number which is
>> calculated at init time to be 6 * number of nodes which is separate from
>> hawq_rm_nvseg_perquery_perseg_limit for random and external tables.  So
>> it
>> is already separate.
>>
>> Any chance we can get the bucketnum from gp_distribution_policy displayed
>> in psql when a user uses "\d" to define the table?
>>
>>
>>
>> Jon Roberts
>> Principal Engineer | jroberts@pivotal.io | 615-426-8661
>>
>> On Tue, Dec 6, 2016 at 2:53 PM, Yi Jin <yj...@pivotal.io> wrote:
>>
>> > Hi Jon,
>> >
>> > I think to me it is a good news that we can increase
>> > hawq_rm_nvseg_perquery_perseg_limit to improve performance when
>> accessing
>> > randomly distributed table. I think this limit is just a upper limit for
>> > random table. In my opinion, it is not active when considering a hash
>> > distributed table, and it is not considered when deciding the bucket
>> number
>> > of a hash table.
>> >
>> > So, even setting 24 as high as you mentioned, I think hash table always
>> > follows its bucket number to acquire virtual segments. I think Hubert (
>> > hzhang@pivotal.io) can provide you more information how to decide
>> bucket
>> > number of a hash distributed table and how to decide number of virtual
>> > segments for a query accessing mixed distributed tables.
>> >
>> > I want to mention another case that we have tight resource or busy
>> > workload, query for random distributed table will not get stable number
>> of
>> > virtual segments as a hash table, even when
>> hawq_rm_nvseg_perquery_perseg_
>> > limit
>> > is set as high as 24.
>> >
>> > Best,
>> > Yi
>> >
>> > On Wed, Dec 7, 2016 at 4:49 AM, Jon Roberts <jr...@pivotal.io>
>> wrote:
>> >
>> > > I've been testing TPC-DS queries and found that I can get Randomly
>> > > Distributed tables to outperform Hash Distributed tables by increasing
>> > > hawq_rm_nvseg_perquery_perseg_limit on a per query basis to as high
>> as
>> > 24.
>> > >
>> > > For Hash Distributed tables, 24 is way too high.  It is also not a
>> great
>> > > idea to make the default so high in case users are creating a mix of
>> > Random
>> > > and Hash Distributed Tables.
>> > >
>> > > Would it be possible to make this one GUC separated into two so that
>> you
>> > > can leave it 6 for Hash Distributed tables but another value like 16
>> for
>> > > Randomly Distributed tables?
>> > >
>> > > This enhancement would also make it possible for later improvements in
>> > the
>> > > optimizer to determine how many vsegs to use.  For example, some
>> queries
>> > > worked best set to 12 while others greatly benefited when set to 24.
>> > >
>> > >
>> > > Jon Roberts
>> > >
>> >
>>
>
>


-- 
Thanks

Hubert Zhang

Re: hawq_rm_nvseg_perquery_perseg_limit

Posted by Yi Jin <yj...@pivotal.io>.
I checked the code for describing table using commands \d and \d+. I think
in current version we cannot get the bucket number through just \d command.
Maybe we can get a chance to improve this in future. This should be easy to
add because describe uses sql statements to query expected information.


On Wed, Dec 7, 2016 at 8:20 AM, Jon Roberts <jr...@pivotal.io> wrote:

> Oh, I missed that there is default_hash_table_bucket_number which is
> calculated at init time to be 6 * number of nodes which is separate from
> hawq_rm_nvseg_perquery_perseg_limit for random and external tables.  So it
> is already separate.
>
> Any chance we can get the bucketnum from gp_distribution_policy displayed
> in psql when a user uses "\d" to define the table?
>
>
>
> Jon Roberts
> Principal Engineer | jroberts@pivotal.io | 615-426-8661
>
> On Tue, Dec 6, 2016 at 2:53 PM, Yi Jin <yj...@pivotal.io> wrote:
>
> > Hi Jon,
> >
> > I think to me it is a good news that we can increase
> > hawq_rm_nvseg_perquery_perseg_limit to improve performance when
> accessing
> > randomly distributed table. I think this limit is just a upper limit for
> > random table. In my opinion, it is not active when considering a hash
> > distributed table, and it is not considered when deciding the bucket
> number
> > of a hash table.
> >
> > So, even setting 24 as high as you mentioned, I think hash table always
> > follows its bucket number to acquire virtual segments. I think Hubert (
> > hzhang@pivotal.io) can provide you more information how to decide bucket
> > number of a hash distributed table and how to decide number of virtual
> > segments for a query accessing mixed distributed tables.
> >
> > I want to mention another case that we have tight resource or busy
> > workload, query for random distributed table will not get stable number
> of
> > virtual segments as a hash table, even when
> hawq_rm_nvseg_perquery_perseg_
> > limit
> > is set as high as 24.
> >
> > Best,
> > Yi
> >
> > On Wed, Dec 7, 2016 at 4:49 AM, Jon Roberts <jr...@pivotal.io> wrote:
> >
> > > I've been testing TPC-DS queries and found that I can get Randomly
> > > Distributed tables to outperform Hash Distributed tables by increasing
> > > hawq_rm_nvseg_perquery_perseg_limit on a per query basis to as high as
> > 24.
> > >
> > > For Hash Distributed tables, 24 is way too high.  It is also not a
> great
> > > idea to make the default so high in case users are creating a mix of
> > Random
> > > and Hash Distributed Tables.
> > >
> > > Would it be possible to make this one GUC separated into two so that
> you
> > > can leave it 6 for Hash Distributed tables but another value like 16
> for
> > > Randomly Distributed tables?
> > >
> > > This enhancement would also make it possible for later improvements in
> > the
> > > optimizer to determine how many vsegs to use.  For example, some
> queries
> > > worked best set to 12 while others greatly benefited when set to 24.
> > >
> > >
> > > Jon Roberts
> > >
> >
>

Re: hawq_rm_nvseg_perquery_perseg_limit

Posted by Jon Roberts <jr...@pivotal.io>.
Oh, I missed that there is default_hash_table_bucket_number which is
calculated at init time to be 6 * number of nodes which is separate from
hawq_rm_nvseg_perquery_perseg_limit for random and external tables.  So it
is already separate.

Any chance we can get the bucketnum from gp_distribution_policy displayed
in psql when a user uses "\d" to define the table?



Jon Roberts
Principal Engineer | jroberts@pivotal.io | 615-426-8661

On Tue, Dec 6, 2016 at 2:53 PM, Yi Jin <yj...@pivotal.io> wrote:

> Hi Jon,
>
> I think to me it is a good news that we can increase
> hawq_rm_nvseg_perquery_perseg_limit to improve performance when accessing
> randomly distributed table. I think this limit is just a upper limit for
> random table. In my opinion, it is not active when considering a hash
> distributed table, and it is not considered when deciding the bucket number
> of a hash table.
>
> So, even setting 24 as high as you mentioned, I think hash table always
> follows its bucket number to acquire virtual segments. I think Hubert (
> hzhang@pivotal.io) can provide you more information how to decide bucket
> number of a hash distributed table and how to decide number of virtual
> segments for a query accessing mixed distributed tables.
>
> I want to mention another case that we have tight resource or busy
> workload, query for random distributed table will not get stable number of
> virtual segments as a hash table, even when hawq_rm_nvseg_perquery_perseg_
> limit
> is set as high as 24.
>
> Best,
> Yi
>
> On Wed, Dec 7, 2016 at 4:49 AM, Jon Roberts <jr...@pivotal.io> wrote:
>
> > I've been testing TPC-DS queries and found that I can get Randomly
> > Distributed tables to outperform Hash Distributed tables by increasing
> > hawq_rm_nvseg_perquery_perseg_limit on a per query basis to as high as
> 24.
> >
> > For Hash Distributed tables, 24 is way too high.  It is also not a great
> > idea to make the default so high in case users are creating a mix of
> Random
> > and Hash Distributed Tables.
> >
> > Would it be possible to make this one GUC separated into two so that you
> > can leave it 6 for Hash Distributed tables but another value like 16 for
> > Randomly Distributed tables?
> >
> > This enhancement would also make it possible for later improvements in
> the
> > optimizer to determine how many vsegs to use.  For example, some queries
> > worked best set to 12 while others greatly benefited when set to 24.
> >
> >
> > Jon Roberts
> >
>

Re: hawq_rm_nvseg_perquery_perseg_limit

Posted by Yi Jin <yj...@pivotal.io>.
Hi Jon,

I think to me it is a good news that we can increase
hawq_rm_nvseg_perquery_perseg_limit to improve performance when accessing
randomly distributed table. I think this limit is just a upper limit for
random table. In my opinion, it is not active when considering a hash
distributed table, and it is not considered when deciding the bucket number
of a hash table.

So, even setting 24 as high as you mentioned, I think hash table always
follows its bucket number to acquire virtual segments. I think Hubert (
hzhang@pivotal.io) can provide you more information how to decide bucket
number of a hash distributed table and how to decide number of virtual
segments for a query accessing mixed distributed tables.

I want to mention another case that we have tight resource or busy
workload, query for random distributed table will not get stable number of
virtual segments as a hash table, even when hawq_rm_nvseg_perquery_perseg_limit
is set as high as 24.

Best,
Yi

On Wed, Dec 7, 2016 at 4:49 AM, Jon Roberts <jr...@pivotal.io> wrote:

> I've been testing TPC-DS queries and found that I can get Randomly
> Distributed tables to outperform Hash Distributed tables by increasing
> hawq_rm_nvseg_perquery_perseg_limit on a per query basis to as high as 24.
>
> For Hash Distributed tables, 24 is way too high.  It is also not a great
> idea to make the default so high in case users are creating a mix of Random
> and Hash Distributed Tables.
>
> Would it be possible to make this one GUC separated into two so that you
> can leave it 6 for Hash Distributed tables but another value like 16 for
> Randomly Distributed tables?
>
> This enhancement would also make it possible for later improvements in the
> optimizer to determine how many vsegs to use.  For example, some queries
> worked best set to 12 while others greatly benefited when set to 24.
>
>
> Jon Roberts
>