You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by 黄权隆 <hu...@gmail.com> on 2017/08/30 22:50:44 UTC

Question about the multi-thread scan node model

Hi all,


I’m working on applying our orc-support patch into the latest code bases (
IMPALA-5717 <https://issues.apache.org/jira/browse/IMPALA-5717>). Since our
patch is based on cdh-5.7.3-release which was released one year ago,
there’re lots of work to merge it.


One of the biggest changes from cdh-5.7.3-release I notice is the new scan
node & scanner model introduced in IMPALA-3902
<https://issues.apache.org/jira/browse/IMPALA-3902>. I think it’s inspired
by the investigating task in IMPALA-2849
<https://issues.apache.org/jira/browse/IMPALA-2849>, but I cannot find any
performance report in this issue. Could you share some report about this
multi-thread refactor?


I’m wondering how much this can improve the performance, since the old
single thread scan node & multi-thread scanners model has supplied
concurrent IO for reading, and most of the queries in OLAP are IO bound.


Thanks,

Quanlong

Re: Question about the multi-thread scan node model

Posted by "huangquanlong@gmail.com" <hu...@gmail.com>.
Got it. Thanks Tim!

On 2017-09-01 00:53, Tim Armstrong <ta...@cloudera.com> wrote: 
> I spoke to Alex Behm off-list about that JIRA a while ago. I don't think
> it's a true ramp-up task. The code change is easy but I think we would want
> to do performance validation and testing to make sure that the new
> multithreaded scanners have similar performance and stability before making
> them the default.
> 
> On Thu, Aug 31, 2017 at 12:34 AM, huangquanlong@gmail.com <
> huangquanlong@gmail.com> wrote:
> 
> > Yeah, "compute stats" is really cpu bound. That sounds great!
> >
> > I noticed that one of the sub tasks of multithreading work is labeled with
> > "ramp up": https://issues.apache.org/jira/browse/IMPALA-5802
> > Is this on progress? If not, could you reassign it to me to familiar with
> > the latest framework?
> >
> > Thanks,
> > Quanlong
> >
> > On 2017-08-31 07:16, Tim Armstrong <ta...@cloudera.com> wrote:
> > > Hi,
> > >   The new scanner model is part of the multithreading work to support
> > > running multiple instances of each fragment on each Impala daemon. The
> > idea
> > > there is that parallelisation is done at the fragment level so that all
> > > execution including aggregations, sorts, joins is parallelised - not just
> > > scans. This is enabled by setting mt_dop > 0. Currently it doesn't work
> > for
> > > plans including joins and HDFS inserts.
> > >
> > > We find that a lot of queries are compute bound, particularly by
> > > aggregations and joins. In those cases we get big speedups from the newer
> > > multithreading model. E.g. "compute stats" is a lot faster.
> > >
> > > On Wed, Aug 30, 2017 at 3:50 PM, 黄权隆 <hu...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > >
> > > > I’m working on applying our orc-support patch into the latest code
> > bases (
> > > > IMPALA-5717 <https://issues.apache.org/jira/browse/IMPALA-5717>).
> > Since
> > > > our
> > > > patch is based on cdh-5.7.3-release which was released one year ago,
> > > > there’re lots of work to merge it.
> > > >
> > > >
> > > > One of the biggest changes from cdh-5.7.3-release I notice is the new
> > scan
> > > > node & scanner model introduced in IMPALA-3902
> > > > <https://issues.apache.org/jira/browse/IMPALA-3902>. I think it’s
> > inspired
> > > > by the investigating task in IMPALA-2849
> > > > <https://issues.apache.org/jira/browse/IMPALA-2849>, but I cannot
> > find any
> > > > performance report in this issue. Could you share some report about
> > this
> > > > multi-thread refactor?
> > > >
> > > >
> > > > I’m wondering how much this can improve the performance, since the old
> > > > single thread scan node & multi-thread scanners model has supplied
> > > > concurrent IO for reading, and most of the queries in OLAP are IO
> > bound.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Quanlong
> > > >
> > >
> >
> 

Re: Question about the multi-thread scan node model

Posted by Tim Armstrong <ta...@cloudera.com>.
I spoke to Alex Behm off-list about that JIRA a while ago. I don't think
it's a true ramp-up task. The code change is easy but I think we would want
to do performance validation and testing to make sure that the new
multithreaded scanners have similar performance and stability before making
them the default.

On Thu, Aug 31, 2017 at 12:34 AM, huangquanlong@gmail.com <
huangquanlong@gmail.com> wrote:

> Yeah, "compute stats" is really cpu bound. That sounds great!
>
> I noticed that one of the sub tasks of multithreading work is labeled with
> "ramp up": https://issues.apache.org/jira/browse/IMPALA-5802
> Is this on progress? If not, could you reassign it to me to familiar with
> the latest framework?
>
> Thanks,
> Quanlong
>
> On 2017-08-31 07:16, Tim Armstrong <ta...@cloudera.com> wrote:
> > Hi,
> >   The new scanner model is part of the multithreading work to support
> > running multiple instances of each fragment on each Impala daemon. The
> idea
> > there is that parallelisation is done at the fragment level so that all
> > execution including aggregations, sorts, joins is parallelised - not just
> > scans. This is enabled by setting mt_dop > 0. Currently it doesn't work
> for
> > plans including joins and HDFS inserts.
> >
> > We find that a lot of queries are compute bound, particularly by
> > aggregations and joins. In those cases we get big speedups from the newer
> > multithreading model. E.g. "compute stats" is a lot faster.
> >
> > On Wed, Aug 30, 2017 at 3:50 PM, 黄权隆 <hu...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > >
> > > I’m working on applying our orc-support patch into the latest code
> bases (
> > > IMPALA-5717 <https://issues.apache.org/jira/browse/IMPALA-5717>).
> Since
> > > our
> > > patch is based on cdh-5.7.3-release which was released one year ago,
> > > there’re lots of work to merge it.
> > >
> > >
> > > One of the biggest changes from cdh-5.7.3-release I notice is the new
> scan
> > > node & scanner model introduced in IMPALA-3902
> > > <https://issues.apache.org/jira/browse/IMPALA-3902>. I think it’s
> inspired
> > > by the investigating task in IMPALA-2849
> > > <https://issues.apache.org/jira/browse/IMPALA-2849>, but I cannot
> find any
> > > performance report in this issue. Could you share some report about
> this
> > > multi-thread refactor?
> > >
> > >
> > > I’m wondering how much this can improve the performance, since the old
> > > single thread scan node & multi-thread scanners model has supplied
> > > concurrent IO for reading, and most of the queries in OLAP are IO
> bound.
> > >
> > >
> > > Thanks,
> > >
> > > Quanlong
> > >
> >
>

Re: Question about the multi-thread scan node model

Posted by "huangquanlong@gmail.com" <hu...@gmail.com>.
Yeah, "compute stats" is really cpu bound. That sounds great!

I noticed that one of the sub tasks of multithreading work is labeled with "ramp up": https://issues.apache.org/jira/browse/IMPALA-5802
Is this on progress? If not, could you reassign it to me to familiar with the latest framework?

Thanks,
Quanlong

On 2017-08-31 07:16, Tim Armstrong <ta...@cloudera.com> wrote: 
> Hi,
>   The new scanner model is part of the multithreading work to support
> running multiple instances of each fragment on each Impala daemon. The idea
> there is that parallelisation is done at the fragment level so that all
> execution including aggregations, sorts, joins is parallelised - not just
> scans. This is enabled by setting mt_dop > 0. Currently it doesn't work for
> plans including joins and HDFS inserts.
> 
> We find that a lot of queries are compute bound, particularly by
> aggregations and joins. In those cases we get big speedups from the newer
> multithreading model. E.g. "compute stats" is a lot faster.
> 
> On Wed, Aug 30, 2017 at 3:50 PM, 黄权隆 <hu...@gmail.com> wrote:
> 
> > Hi all,
> >
> >
> > I’m working on applying our orc-support patch into the latest code bases (
> > IMPALA-5717 <https://issues.apache.org/jira/browse/IMPALA-5717>). Since
> > our
> > patch is based on cdh-5.7.3-release which was released one year ago,
> > there’re lots of work to merge it.
> >
> >
> > One of the biggest changes from cdh-5.7.3-release I notice is the new scan
> > node & scanner model introduced in IMPALA-3902
> > <https://issues.apache.org/jira/browse/IMPALA-3902>. I think it’s inspired
> > by the investigating task in IMPALA-2849
> > <https://issues.apache.org/jira/browse/IMPALA-2849>, but I cannot find any
> > performance report in this issue. Could you share some report about this
> > multi-thread refactor?
> >
> >
> > I’m wondering how much this can improve the performance, since the old
> > single thread scan node & multi-thread scanners model has supplied
> > concurrent IO for reading, and most of the queries in OLAP are IO bound.
> >
> >
> > Thanks,
> >
> > Quanlong
> >
> 

Re: Question about the multi-thread scan node model

Posted by Tim Armstrong <ta...@cloudera.com>.
Hi,
  The new scanner model is part of the multithreading work to support
running multiple instances of each fragment on each Impala daemon. The idea
there is that parallelisation is done at the fragment level so that all
execution including aggregations, sorts, joins is parallelised - not just
scans. This is enabled by setting mt_dop > 0. Currently it doesn't work for
plans including joins and HDFS inserts.

We find that a lot of queries are compute bound, particularly by
aggregations and joins. In those cases we get big speedups from the newer
multithreading model. E.g. "compute stats" is a lot faster.

On Wed, Aug 30, 2017 at 3:50 PM, 黄权隆 <hu...@gmail.com> wrote:

> Hi all,
>
>
> I’m working on applying our orc-support patch into the latest code bases (
> IMPALA-5717 <https://issues.apache.org/jira/browse/IMPALA-5717>). Since
> our
> patch is based on cdh-5.7.3-release which was released one year ago,
> there’re lots of work to merge it.
>
>
> One of the biggest changes from cdh-5.7.3-release I notice is the new scan
> node & scanner model introduced in IMPALA-3902
> <https://issues.apache.org/jira/browse/IMPALA-3902>. I think it’s inspired
> by the investigating task in IMPALA-2849
> <https://issues.apache.org/jira/browse/IMPALA-2849>, but I cannot find any
> performance report in this issue. Could you share some report about this
> multi-thread refactor?
>
>
> I’m wondering how much this can improve the performance, since the old
> single thread scan node & multi-thread scanners model has supplied
> concurrent IO for reading, and most of the queries in OLAP are IO bound.
>
>
> Thanks,
>
> Quanlong
>