You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tajo.apache.org by Atri Sharma <at...@gmail.com> on 2015/06/15 16:32:43 UTC

Parallel Aggregates

Folks,

I am looking into parallel aggregates/combining aggregates. I have a plan
around it which I think can work.

Please update me on current infrastructure and point me around the existing
code base. Also, ideas would be most welcome around it.

-- 
Regards,

Atri
*l'apprenant*

Re: Parallel Aggregates

Posted by Atri Sharma <at...@gmail.com>.

Thanks!
On 18 Jun 2015 21:24, "Jaehwa Jung" <bl...@apache.org> wrote:

> Hi Atri
>
> I think that following articles would be helpful for you to understand tajo
> architecture.
>
> - Tajo query execution and scheduling sequence : http://jaso.co.kr/501
> - Tajo stage flow and StorageManager: http://jaso.co.kr/503
>
> For the reference, above articles had been written by Hyoungjun Kim who is
> Tajo PMC. :)
>
> Cheers
> Jaehwa
>
> 2015-06-18 20:21 GMT+09:00 Jihoon Son <ji...@apache.org>:
>
> > It seems that there aren't ongoing issues for distinct aggregation.
> >
> > 2015년 6월 18일 (목) 오전 10:50, Atri Sharma <at...@gmail.com>님이 작성:
> >
> > > And for DISTINCT issues?
> > > On 18 Jun 2015 15:16, "Jihoon Son" <gh...@gmail.com> wrote:
> > >
> > > > As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for
> data
> > > > cube and grouping sets.
> > > > You can create any issues if you want. Even though some issues can be
> > > > duplicated, it's ok.
> > > > 2015년 6월 18일 (목) 오전 10:39, Atri Sharma <at...@gmail.com>님이 작성:
> > > >
> > > > > Do we have a ticket around that?
> > > > > On 18 Jun 2015 15:07, "Jihoon Son" <ji...@apache.org> wrote:
> > > > >
> > > > > > It looks good to start.
> > > > > > Any questions welcome!
> > > > > >
> > > > > > Jihoon
> > > > > >
> > > > > > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <at...@gmail.com>님이
> 작성:
> > > > > >
> > > > > > > So distinct aggregation is one area, thanks.
> > > > > > >
> > > > > > > I am trying to get enough knowledge of Internals of aggregation
> > > > engine
> > > > > > and
> > > > > > > query planner to be able to work on rollup and cube so picking
> > > > smaller
> > > > > > > tickets first.
> > > > > > > On 18 Jun 2015 02:42, "Jihoon Son" <ji...@apache.org>
> wrote:
> > > > > > >
> > > > > > > > As far as I know, there aren't any plans for improvement
> except
> > > in
> > > > > > > distinct
> > > > > > > > aggregation. I think that our code for distinct aggregation
> is
> > > too
> > > > > > > > complicated, and the performance also should be improved.
> > > > > > > >
> > > > > > > > So, when you design the implementation of your algorithm on
> > Tajo,
> > > > you
> > > > > > > don't
> > > > > > > > have to consider distinct aggregation part, I think.
> > > > > > > >
> > > > > > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <atri.jiit@gmail.com
> >님이
> > > 작성:
> > > > > > > >
> > > > > > > > > Thank you.
> > > > > > > > >
> > > > > > > > > Is there any improvement in aggregates that we are looking
> at
> > > > > please?
> > > > > > > > > On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > In Tajo, aggregation is very similar to that in Hadoop
> > > > MapReduce.
> > > > > > > > > > Let me consider an example. Given a query of "select *k*,
> > > > > count(*)
> > > > > > > from
> > > > > > > > > *t*
> > > > > > > > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > > > > > > > >
> > > > > > > > > > group by (k)
> > > > > > > > > >        |
> > > > > > > > > >    scan (t)
> > > > > > > > > >
> > > > > > > > > > This LogicalPlan is translated into a MasterPlan as
> > follows.
> > > > > > > > > >
> > > > > > > > > > -----------------
> > > > > > > > > >      Stage2
> > > > > > > > > >   group by *k*
> > > > > > > > > > -----------------
> > > > > > > > > >           |
> > > > > > > > > > shuffle tuples with *k*
> > > > > > > > > >           |
> > > > > > > > > > -----------------
> > > > > > > > > >      Stage1
> > > > > > > > > >   group by *k*
> > > > > > > > > >          |
> > > > > > > > > >     scan *t*
> > > > > > > > > > -----------------
> > > > > > > > > >
> > > > > > > > > > As you can see in this example, the query plan consists
> of
> > 2
> > > > > > stages.
> > > > > > > > Each
> > > > > > > > > > stage is executed subsequently because the result of
> Stage
> > 1
> > > is
> > > > > > used
> > > > > > > as
> > > > > > > > > the
> > > > > > > > > > input of Stage 2. Each stage is divided into multiple
> tasks
> > > for
> > > > > > each
> > > > > > > > > input
> > > > > > > > > > split as follows.
> > > > > > > > > >
> > > > > > > > > > Stage1
> > > > > > > > > >
> > > > > > > > > > Task 1
> > > > > > > > > > group by *k*
> > > > > > > > > >        |
> > > > > > > > > >   scan *t* (0 - 99)
> > > > > > > > > >
> > > > > > > > > > Task 2
> > > > > > > > > > group by *k*
> > > > > > > > > >        |
> > > > > > > > > >   scan *t* (100 - 199)
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > Each task is executed by a TajoWorker. As you can see,
> > tasks
> > > of
> > > > > the
> > > > > > > > first
> > > > > > > > > > stage execute a local aggregation after scanning input
> > split.
> > > > > This
> > > > > > > > local
> > > > > > > > > > aggregation result is shuffled among TajoWorkers with the
> > > > > > aggregation
> > > > > > > > key
> > > > > > > > > > *k*. Then, the final aggregation is computed at the
> second
> > > > stage.
> > > > > > > > > >
> > > > > > > > > > Stage1 and Stage2 are similar to Map and Reduce of
> > MapReduce.
> > > > The
> > > > > > > local
> > > > > > > > > > aggregation of Stage1 is similar to the Combiner of
> Hadoop
> > > > > > MapReduce.
> > > > > > > > > >
> > > > > > > > > > I hope that this will be helpful to you.
> > > > > > > > > > If you have any further questions, please feel free to
> ask.
> > > > > > > > > > Jihoon
> > > > > > > > > >
> > > > > > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <
> atri.jiit@gmail.com
> > > >님이
> > > > > 작성:
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > > >
> > > > > > > > > > > What are your thoughts on parallel aggregation?
> > Generating
> > > > > query
> > > > > > > > plans
> > > > > > > > > > that
> > > > > > > > > > > allow states to be generated which can be executed
> > > > > independently
> > > > > > > and
> > > > > > > > > then
> > > > > > > > > > > states recombined?
> > > > > > > > > > > On 16 Jun 2015 05:25, "Jihoon Son" <
> jihoonson@apache.org
> > >
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Atri, thanks for your question.
> > > > > > > > > > > >
> > > > > > > > > > > > First of all, maybe you already did, I recommend that
> > you
> > > > > read
> > > > > > > this
> > > > > > > > > > > article
> > > > > > > > > > > > <
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > > > > > > >
> > > > > > > > > > > > before you start implementation. This is written by
> > > > Hyunsik,
> > > > > > and
> > > > > > > > > > contains
> > > > > > > > > > > > the description of Tajo's overall infrastructure.
> > > > > Afterwards, I
> > > > > > > > think
> > > > > > > > > > > that
> > > > > > > > > > > > you may ask more detailed question.
> > > > > > > > > > > >
> > > > > > > > > > > > Here, I'll roughly list some important classes for
> > > > aggregate
> > > > > > > > > > > > implementation.
> > > > > > > > > > > >
> > > > > > > > > > > >    - SQLParser.g4 contains our SQL parsing rules. It
> is
> > > > > written
> > > > > > > in
> > > > > > > > > > antlr.
> > > > > > > > > > > >    - SQLAnalyzer is our parser based on rules defined
> > at
> > > > > > > > > SQLParser.g4.
> > > > > > > > > > > >    - SQLAnalyzer translates a SQL query into a tree
> of
> > > Expr
> > > > > > which
> > > > > > > > > > > >    represents an algebraic expression.
> > > > > > > > > > > >    - LogicalPlanner translates the Expr tree into a
> > > > > LogicalPlan
> > > > > > > > that
> > > > > > > > > > > >    logically describes how the given query will be
> > > > executed.
> > > > > > > > > > > >    - GlobalPlanner translates the LogicalPlan into a
> > > > > MasterPlan
> > > > > > > > > > > >    (distributed query execution plan) that describes
> > how
> > > > the
> > > > > > > given
> > > > > > > > > > query
> > > > > > > > > > > > will
> > > > > > > > > > > >    be executed in distributed cluster.
> > > > > > > > > > > >    - Once a MasterPlan is created, QueryMaster starts
> > to
> > > > > > execute
> > > > > > > > > query
> > > > > > > > > > > >    processing. A query consists of multiple stages,
> > which
> > > > are
> > > > > > > > > > > individually
> > > > > > > > > > > >    processed in some order.
> > > > > > > > > > > >       - For example, a simple aggregation query is
> > > executed
> > > > > in
> > > > > > > two
> > > > > > > > > > > stages,
> > > > > > > > > > > >       each of which is for parallel aggregation and
> > > > combining
> > > > > > > > > > aggregates.
> > > > > > > > > > > > These
> > > > > > > > > > > >       stages are executed sequentially.
> > > > > > > > > > > >    - A stage is concurrently processed by multiple
> > tasks,
> > > > and
> > > > > > is
> > > > > > > > > > executed
> > > > > > > > > > > >    by TajoWorker.
> > > > > > > > > > > >    - Each task contains meta information for input
> data
> > > > and a
> > > > > > > > > > LogicalPlan
> > > > > > > > > > > >    of the stage. This LogicalPlan is translated into
> > > > > > PhysicalExec
> > > > > > > > by
> > > > > > > > > > > >    PhysicalPlanner.
> > > > > > > > > > > >    - PhysicalExec describes how the query is actually
> > > > > executed.
> > > > > > > > > > > >       - For example, there are two types of
> > > > AggregationExec,
> > > > > > > > > > > >       i.e., HashAggregateExec and SortAggregateExec,
> > for
> > > > > > > hash-based
> > > > > > > > > > > > aggregation
> > > > > > > > > > > >       and sort-based aggregation, respectively.
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Jihoon
> > > > > > > > > > > >
> > > > > > > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <
> > > > atri.jiit@gmail.com
> > > > > >님이
> > > > > > > 작성:
> > > > > > > > > > > >
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am looking into parallel aggregates/combining
> > > > > aggregates. I
> > > > > > > > have
> > > > > > > > > a
> > > > > > > > > > > plan
> > > > > > > > > > > > > around it which I think can work.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please update me on current infrastructure and
> point
> > me
> > > > > > around
> > > > > > > > the
> > > > > > > > > > > > existing
> > > > > > > > > > > > > code base. Also, ideas would be most welcome around
> > it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Atri
> > > > > > > > > > > > > *l'apprenant*
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Jaehwa Jung <bl...@apache.org>.

Hi Atri

I think that following articles would be helpful for you to understand tajo
architecture.

- Tajo query execution and scheduling sequence : http://jaso.co.kr/501
- Tajo stage flow and StorageManager: http://jaso.co.kr/503

For the reference, above articles had been written by Hyoungjun Kim who is
Tajo PMC. :)

Cheers
Jaehwa

2015-06-18 20:21 GMT+09:00 Jihoon Son <ji...@apache.org>:

> It seems that there aren't ongoing issues for distinct aggregation.
>
> 2015년 6월 18일 (목) 오전 10:50, Atri Sharma <at...@gmail.com>님이 작성:
>
> > And for DISTINCT issues?
> > On 18 Jun 2015 15:16, "Jihoon Son" <gh...@gmail.com> wrote:
> >
> > > As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for data
> > > cube and grouping sets.
> > > You can create any issues if you want. Even though some issues can be
> > > duplicated, it's ok.
> > > 2015년 6월 18일 (목) 오전 10:39, Atri Sharma <at...@gmail.com>님이 작성:
> > >
> > > > Do we have a ticket around that?
> > > > On 18 Jun 2015 15:07, "Jihoon Son" <ji...@apache.org> wrote:
> > > >
> > > > > It looks good to start.
> > > > > Any questions welcome!
> > > > >
> > > > > Jihoon
> > > > >
> > > > > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <at...@gmail.com>님이 작성:
> > > > >
> > > > > > So distinct aggregation is one area, thanks.
> > > > > >
> > > > > > I am trying to get enough knowledge of Internals of aggregation
> > > engine
> > > > > and
> > > > > > query planner to be able to work on rollup and cube so picking
> > > smaller
> > > > > > tickets first.
> > > > > > On 18 Jun 2015 02:42, "Jihoon Son" <ji...@apache.org> wrote:
> > > > > >
> > > > > > > As far as I know, there aren't any plans for improvement except
> > in
> > > > > > distinct
> > > > > > > aggregation. I think that our code for distinct aggregation is
> > too
> > > > > > > complicated, and the performance also should be improved.
> > > > > > >
> > > > > > > So, when you design the implementation of your algorithm on
> Tajo,
> > > you
> > > > > > don't
> > > > > > > have to consider distinct aggregation part, I think.
> > > > > > >
> > > > > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <at...@gmail.com>님이
> > 작성:
> > > > > > >
> > > > > > > > Thank you.
> > > > > > > >
> > > > > > > > Is there any improvement in aggregates that we are looking at
> > > > please?
> > > > > > > > On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org>
> > wrote:
> > > > > > > >
> > > > > > > > > In Tajo, aggregation is very similar to that in Hadoop
> > > MapReduce.
> > > > > > > > > Let me consider an example. Given a query of "select *k*,
> > > > count(*)
> > > > > > from
> > > > > > > > *t*
> > > > > > > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > > > > > > >
> > > > > > > > > group by (k)
> > > > > > > > >        |
> > > > > > > > >    scan (t)
> > > > > > > > >
> > > > > > > > > This LogicalPlan is translated into a MasterPlan as
> follows.
> > > > > > > > >
> > > > > > > > > -----------------
> > > > > > > > >      Stage2
> > > > > > > > >   group by *k*
> > > > > > > > > -----------------
> > > > > > > > >           |
> > > > > > > > > shuffle tuples with *k*
> > > > > > > > >           |
> > > > > > > > > -----------------
> > > > > > > > >      Stage1
> > > > > > > > >   group by *k*
> > > > > > > > >          |
> > > > > > > > >     scan *t*
> > > > > > > > > -----------------
> > > > > > > > >
> > > > > > > > > As you can see in this example, the query plan consists of
> 2
> > > > > stages.
> > > > > > > Each
> > > > > > > > > stage is executed subsequently because the result of Stage
> 1
> > is
> > > > > used
> > > > > > as
> > > > > > > > the
> > > > > > > > > input of Stage 2. Each stage is divided into multiple tasks
> > for
> > > > > each
> > > > > > > > input
> > > > > > > > > split as follows.
> > > > > > > > >
> > > > > > > > > Stage1
> > > > > > > > >
> > > > > > > > > Task 1
> > > > > > > > > group by *k*
> > > > > > > > >        |
> > > > > > > > >   scan *t* (0 - 99)
> > > > > > > > >
> > > > > > > > > Task 2
> > > > > > > > > group by *k*
> > > > > > > > >        |
> > > > > > > > >   scan *t* (100 - 199)
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > Each task is executed by a TajoWorker. As you can see,
> tasks
> > of
> > > > the
> > > > > > > first
> > > > > > > > > stage execute a local aggregation after scanning input
> split.
> > > > This
> > > > > > > local
> > > > > > > > > aggregation result is shuffled among TajoWorkers with the
> > > > > aggregation
> > > > > > > key
> > > > > > > > > *k*. Then, the final aggregation is computed at the second
> > > stage.
> > > > > > > > >
> > > > > > > > > Stage1 and Stage2 are similar to Map and Reduce of
> MapReduce.
> > > The
> > > > > > local
> > > > > > > > > aggregation of Stage1 is similar to the Combiner of Hadoop
> > > > > MapReduce.
> > > > > > > > >
> > > > > > > > > I hope that this will be helpful to you.
> > > > > > > > > If you have any further questions, please feel free to ask.
> > > > > > > > > Jihoon
> > > > > > > > >
> > > > > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <atri.jiit@gmail.com
> > >님이
> > > > 작성:
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > What are your thoughts on parallel aggregation?
> Generating
> > > > query
> > > > > > > plans
> > > > > > > > > that
> > > > > > > > > > allow states to be generated which can be executed
> > > > independently
> > > > > > and
> > > > > > > > then
> > > > > > > > > > states recombined?
> > > > > > > > > > On 16 Jun 2015 05:25, "Jihoon Son" <jihoonson@apache.org
> >
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Atri, thanks for your question.
> > > > > > > > > > >
> > > > > > > > > > > First of all, maybe you already did, I recommend that
> you
> > > > read
> > > > > > this
> > > > > > > > > > article
> > > > > > > > > > > <
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > > > > > >
> > > > > > > > > > > before you start implementation. This is written by
> > > Hyunsik,
> > > > > and
> > > > > > > > > contains
> > > > > > > > > > > the description of Tajo's overall infrastructure.
> > > > Afterwards, I
> > > > > > > think
> > > > > > > > > > that
> > > > > > > > > > > you may ask more detailed question.
> > > > > > > > > > >
> > > > > > > > > > > Here, I'll roughly list some important classes for
> > > aggregate
> > > > > > > > > > > implementation.
> > > > > > > > > > >
> > > > > > > > > > >    - SQLParser.g4 contains our SQL parsing rules. It is
> > > > written
> > > > > > in
> > > > > > > > > antlr.
> > > > > > > > > > >    - SQLAnalyzer is our parser based on rules defined
> at
> > > > > > > > SQLParser.g4.
> > > > > > > > > > >    - SQLAnalyzer translates a SQL query into a tree of
> > Expr
> > > > > which
> > > > > > > > > > >    represents an algebraic expression.
> > > > > > > > > > >    - LogicalPlanner translates the Expr tree into a
> > > > LogicalPlan
> > > > > > > that
> > > > > > > > > > >    logically describes how the given query will be
> > > executed.
> > > > > > > > > > >    - GlobalPlanner translates the LogicalPlan into a
> > > > MasterPlan
> > > > > > > > > > >    (distributed query execution plan) that describes
> how
> > > the
> > > > > > given
> > > > > > > > > query
> > > > > > > > > > > will
> > > > > > > > > > >    be executed in distributed cluster.
> > > > > > > > > > >    - Once a MasterPlan is created, QueryMaster starts
> to
> > > > > execute
> > > > > > > > query
> > > > > > > > > > >    processing. A query consists of multiple stages,
> which
> > > are
> > > > > > > > > > individually
> > > > > > > > > > >    processed in some order.
> > > > > > > > > > >       - For example, a simple aggregation query is
> > executed
> > > > in
> > > > > > two
> > > > > > > > > > stages,
> > > > > > > > > > >       each of which is for parallel aggregation and
> > > combining
> > > > > > > > > aggregates.
> > > > > > > > > > > These
> > > > > > > > > > >       stages are executed sequentially.
> > > > > > > > > > >    - A stage is concurrently processed by multiple
> tasks,
> > > and
> > > > > is
> > > > > > > > > executed
> > > > > > > > > > >    by TajoWorker.
> > > > > > > > > > >    - Each task contains meta information for input data
> > > and a
> > > > > > > > > LogicalPlan
> > > > > > > > > > >    of the stage. This LogicalPlan is translated into
> > > > > PhysicalExec
> > > > > > > by
> > > > > > > > > > >    PhysicalPlanner.
> > > > > > > > > > >    - PhysicalExec describes how the query is actually
> > > > executed.
> > > > > > > > > > >       - For example, there are two types of
> > > AggregationExec,
> > > > > > > > > > >       i.e., HashAggregateExec and SortAggregateExec,
> for
> > > > > > hash-based
> > > > > > > > > > > aggregation
> > > > > > > > > > >       and sort-based aggregation, respectively.
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Jihoon
> > > > > > > > > > >
> > > > > > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <
> > > atri.jiit@gmail.com
> > > > >님이
> > > > > > 작성:
> > > > > > > > > > >
> > > > > > > > > > > > Folks,
> > > > > > > > > > > >
> > > > > > > > > > > > I am looking into parallel aggregates/combining
> > > > aggregates. I
> > > > > > > have
> > > > > > > > a
> > > > > > > > > > plan
> > > > > > > > > > > > around it which I think can work.
> > > > > > > > > > > >
> > > > > > > > > > > > Please update me on current infrastructure and point
> me
> > > > > around
> > > > > > > the
> > > > > > > > > > > existing
> > > > > > > > > > > > code base. Also, ideas would be most welcome around
> it.
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Atri
> > > > > > > > > > > > *l'apprenant*
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Jihoon Son <ji...@apache.org>.

It seems that there aren't ongoing issues for distinct aggregation.

2015년 6월 18일 (목) 오전 10:50, Atri Sharma <at...@gmail.com>님이 작성:

> And for DISTINCT issues?
> On 18 Jun 2015 15:16, "Jihoon Son" <gh...@gmail.com> wrote:
>
> > As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for data
> > cube and grouping sets.
> > You can create any issues if you want. Even though some issues can be
> > duplicated, it's ok.
> > 2015년 6월 18일 (목) 오전 10:39, Atri Sharma <at...@gmail.com>님이 작성:
> >
> > > Do we have a ticket around that?
> > > On 18 Jun 2015 15:07, "Jihoon Son" <ji...@apache.org> wrote:
> > >
> > > > It looks good to start.
> > > > Any questions welcome!
> > > >
> > > > Jihoon
> > > >
> > > > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <at...@gmail.com>님이 작성:
> > > >
> > > > > So distinct aggregation is one area, thanks.
> > > > >
> > > > > I am trying to get enough knowledge of Internals of aggregation
> > engine
> > > > and
> > > > > query planner to be able to work on rollup and cube so picking
> > smaller
> > > > > tickets first.
> > > > > On 18 Jun 2015 02:42, "Jihoon Son" <ji...@apache.org> wrote:
> > > > >
> > > > > > As far as I know, there aren't any plans for improvement except
> in
> > > > > distinct
> > > > > > aggregation. I think that our code for distinct aggregation is
> too
> > > > > > complicated, and the performance also should be improved.
> > > > > >
> > > > > > So, when you design the implementation of your algorithm on Tajo,
> > you
> > > > > don't
> > > > > > have to consider distinct aggregation part, I think.
> > > > > >
> > > > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <at...@gmail.com>님이
> 작성:
> > > > > >
> > > > > > > Thank you.
> > > > > > >
> > > > > > > Is there any improvement in aggregates that we are looking at
> > > please?
> > > > > > > On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org>
> wrote:
> > > > > > >
> > > > > > > > In Tajo, aggregation is very similar to that in Hadoop
> > MapReduce.
> > > > > > > > Let me consider an example. Given a query of "select *k*,
> > > count(*)
> > > > > from
> > > > > > > *t*
> > > > > > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > > > > > >
> > > > > > > > group by (k)
> > > > > > > >        |
> > > > > > > >    scan (t)
> > > > > > > >
> > > > > > > > This LogicalPlan is translated into a MasterPlan as follows.
> > > > > > > >
> > > > > > > > -----------------
> > > > > > > >      Stage2
> > > > > > > >   group by *k*
> > > > > > > > -----------------
> > > > > > > >           |
> > > > > > > > shuffle tuples with *k*
> > > > > > > >           |
> > > > > > > > -----------------
> > > > > > > >      Stage1
> > > > > > > >   group by *k*
> > > > > > > >          |
> > > > > > > >     scan *t*
> > > > > > > > -----------------
> > > > > > > >
> > > > > > > > As you can see in this example, the query plan consists of 2
> > > > stages.
> > > > > > Each
> > > > > > > > stage is executed subsequently because the result of Stage 1
> is
> > > > used
> > > > > as
> > > > > > > the
> > > > > > > > input of Stage 2. Each stage is divided into multiple tasks
> for
> > > > each
> > > > > > > input
> > > > > > > > split as follows.
> > > > > > > >
> > > > > > > > Stage1
> > > > > > > >
> > > > > > > > Task 1
> > > > > > > > group by *k*
> > > > > > > >        |
> > > > > > > >   scan *t* (0 - 99)
> > > > > > > >
> > > > > > > > Task 2
> > > > > > > > group by *k*
> > > > > > > >        |
> > > > > > > >   scan *t* (100 - 199)
> > > > > > > > ...
> > > > > > > >
> > > > > > > > Each task is executed by a TajoWorker. As you can see, tasks
> of
> > > the
> > > > > > first
> > > > > > > > stage execute a local aggregation after scanning input split.
> > > This
> > > > > > local
> > > > > > > > aggregation result is shuffled among TajoWorkers with the
> > > > aggregation
> > > > > > key
> > > > > > > > *k*. Then, the final aggregation is computed at the second
> > stage.
> > > > > > > >
> > > > > > > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce.
> > The
> > > > > local
> > > > > > > > aggregation of Stage1 is similar to the Combiner of Hadoop
> > > > MapReduce.
> > > > > > > >
> > > > > > > > I hope that this will be helpful to you.
> > > > > > > > If you have any further questions, please feel free to ask.
> > > > > > > > Jihoon
> > > > > > > >
> > > > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <atri.jiit@gmail.com
> >님이
> > > 작성:
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > What are your thoughts on parallel aggregation? Generating
> > > query
> > > > > > plans
> > > > > > > > that
> > > > > > > > > allow states to be generated which can be executed
> > > independently
> > > > > and
> > > > > > > then
> > > > > > > > > states recombined?
> > > > > > > > > On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Atri, thanks for your question.
> > > > > > > > > >
> > > > > > > > > > First of all, maybe you already did, I recommend that you
> > > read
> > > > > this
> > > > > > > > > article
> > > > > > > > > > <
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > > > > >
> > > > > > > > > > before you start implementation. This is written by
> > Hyunsik,
> > > > and
> > > > > > > > contains
> > > > > > > > > > the description of Tajo's overall infrastructure.
> > > Afterwards, I
> > > > > > think
> > > > > > > > > that
> > > > > > > > > > you may ask more detailed question.
> > > > > > > > > >
> > > > > > > > > > Here, I'll roughly list some important classes for
> > aggregate
> > > > > > > > > > implementation.
> > > > > > > > > >
> > > > > > > > > >    - SQLParser.g4 contains our SQL parsing rules. It is
> > > written
> > > > > in
> > > > > > > > antlr.
> > > > > > > > > >    - SQLAnalyzer is our parser based on rules defined at
> > > > > > > SQLParser.g4.
> > > > > > > > > >    - SQLAnalyzer translates a SQL query into a tree of
> Expr
> > > > which
> > > > > > > > > >    represents an algebraic expression.
> > > > > > > > > >    - LogicalPlanner translates the Expr tree into a
> > > LogicalPlan
> > > > > > that
> > > > > > > > > >    logically describes how the given query will be
> > executed.
> > > > > > > > > >    - GlobalPlanner translates the LogicalPlan into a
> > > MasterPlan
> > > > > > > > > >    (distributed query execution plan) that describes how
> > the
> > > > > given
> > > > > > > > query
> > > > > > > > > > will
> > > > > > > > > >    be executed in distributed cluster.
> > > > > > > > > >    - Once a MasterPlan is created, QueryMaster starts to
> > > > execute
> > > > > > > query
> > > > > > > > > >    processing. A query consists of multiple stages, which
> > are
> > > > > > > > > individually
> > > > > > > > > >    processed in some order.
> > > > > > > > > >       - For example, a simple aggregation query is
> executed
> > > in
> > > > > two
> > > > > > > > > stages,
> > > > > > > > > >       each of which is for parallel aggregation and
> > combining
> > > > > > > > aggregates.
> > > > > > > > > > These
> > > > > > > > > >       stages are executed sequentially.
> > > > > > > > > >    - A stage is concurrently processed by multiple tasks,
> > and
> > > > is
> > > > > > > > executed
> > > > > > > > > >    by TajoWorker.
> > > > > > > > > >    - Each task contains meta information for input data
> > and a
> > > > > > > > LogicalPlan
> > > > > > > > > >    of the stage. This LogicalPlan is translated into
> > > > PhysicalExec
> > > > > > by
> > > > > > > > > >    PhysicalPlanner.
> > > > > > > > > >    - PhysicalExec describes how the query is actually
> > > executed.
> > > > > > > > > >       - For example, there are two types of
> > AggregationExec,
> > > > > > > > > >       i.e., HashAggregateExec and SortAggregateExec, for
> > > > > hash-based
> > > > > > > > > > aggregation
> > > > > > > > > >       and sort-based aggregation, respectively.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Jihoon
> > > > > > > > > >
> > > > > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <
> > atri.jiit@gmail.com
> > > >님이
> > > > > 작성:
> > > > > > > > > >
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > I am looking into parallel aggregates/combining
> > > aggregates. I
> > > > > > have
> > > > > > > a
> > > > > > > > > plan
> > > > > > > > > > > around it which I think can work.
> > > > > > > > > > >
> > > > > > > > > > > Please update me on current infrastructure and point me
> > > > around
> > > > > > the
> > > > > > > > > > existing
> > > > > > > > > > > code base. Also, ideas would be most welcome around it.
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > > *l'apprenant*
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Atri Sharma <at...@gmail.com>.

And for DISTINCT issues?
On 18 Jun 2015 15:16, "Jihoon Son" <gh...@gmail.com> wrote:

> As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for data
> cube and grouping sets.
> You can create any issues if you want. Even though some issues can be
> duplicated, it's ok.
> 2015년 6월 18일 (목) 오전 10:39, Atri Sharma <at...@gmail.com>님이 작성:
>
> > Do we have a ticket around that?
> > On 18 Jun 2015 15:07, "Jihoon Son" <ji...@apache.org> wrote:
> >
> > > It looks good to start.
> > > Any questions welcome!
> > >
> > > Jihoon
> > >
> > > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <at...@gmail.com>님이 작성:
> > >
> > > > So distinct aggregation is one area, thanks.
> > > >
> > > > I am trying to get enough knowledge of Internals of aggregation
> engine
> > > and
> > > > query planner to be able to work on rollup and cube so picking
> smaller
> > > > tickets first.
> > > > On 18 Jun 2015 02:42, "Jihoon Son" <ji...@apache.org> wrote:
> > > >
> > > > > As far as I know, there aren't any plans for improvement except in
> > > > distinct
> > > > > aggregation. I think that our code for distinct aggregation is too
> > > > > complicated, and the performance also should be improved.
> > > > >
> > > > > So, when you design the implementation of your algorithm on Tajo,
> you
> > > > don't
> > > > > have to consider distinct aggregation part, I think.
> > > > >
> > > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <at...@gmail.com>님이 작성:
> > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Is there any improvement in aggregates that we are looking at
> > please?
> > > > > > On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org> wrote:
> > > > > >
> > > > > > > In Tajo, aggregation is very similar to that in Hadoop
> MapReduce.
> > > > > > > Let me consider an example. Given a query of "select *k*,
> > count(*)
> > > > from
> > > > > > *t*
> > > > > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > > > > >
> > > > > > > group by (k)
> > > > > > >        |
> > > > > > >    scan (t)
> > > > > > >
> > > > > > > This LogicalPlan is translated into a MasterPlan as follows.
> > > > > > >
> > > > > > > -----------------
> > > > > > >      Stage2
> > > > > > >   group by *k*
> > > > > > > -----------------
> > > > > > >           |
> > > > > > > shuffle tuples with *k*
> > > > > > >           |
> > > > > > > -----------------
> > > > > > >      Stage1
> > > > > > >   group by *k*
> > > > > > >          |
> > > > > > >     scan *t*
> > > > > > > -----------------
> > > > > > >
> > > > > > > As you can see in this example, the query plan consists of 2
> > > stages.
> > > > > Each
> > > > > > > stage is executed subsequently because the result of Stage 1 is
> > > used
> > > > as
> > > > > > the
> > > > > > > input of Stage 2. Each stage is divided into multiple tasks for
> > > each
> > > > > > input
> > > > > > > split as follows.
> > > > > > >
> > > > > > > Stage1
> > > > > > >
> > > > > > > Task 1
> > > > > > > group by *k*
> > > > > > >        |
> > > > > > >   scan *t* (0 - 99)
> > > > > > >
> > > > > > > Task 2
> > > > > > > group by *k*
> > > > > > >        |
> > > > > > >   scan *t* (100 - 199)
> > > > > > > ...
> > > > > > >
> > > > > > > Each task is executed by a TajoWorker. As you can see, tasks of
> > the
> > > > > first
> > > > > > > stage execute a local aggregation after scanning input split.
> > This
> > > > > local
> > > > > > > aggregation result is shuffled among TajoWorkers with the
> > > aggregation
> > > > > key
> > > > > > > *k*. Then, the final aggregation is computed at the second
> stage.
> > > > > > >
> > > > > > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce.
> The
> > > > local
> > > > > > > aggregation of Stage1 is similar to the Combiner of Hadoop
> > > MapReduce.
> > > > > > >
> > > > > > > I hope that this will be helpful to you.
> > > > > > > If you have any further questions, please feel free to ask.
> > > > > > > Jihoon
> > > > > > >
> > > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <at...@gmail.com>님이
> > 작성:
> > > > > > >
> > > > > > > Thanks.
> > > > > > > >
> > > > > > > > What are your thoughts on parallel aggregation? Generating
> > query
> > > > > plans
> > > > > > > that
> > > > > > > > allow states to be generated which can be executed
> > independently
> > > > and
> > > > > > then
> > > > > > > > states recombined?
> > > > > > > > On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org>
> > wrote:
> > > > > > > >
> > > > > > > > > Hi Atri, thanks for your question.
> > > > > > > > >
> > > > > > > > > First of all, maybe you already did, I recommend that you
> > read
> > > > this
> > > > > > > > article
> > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > > > >
> > > > > > > > > before you start implementation. This is written by
> Hyunsik,
> > > and
> > > > > > > contains
> > > > > > > > > the description of Tajo's overall infrastructure.
> > Afterwards, I
> > > > > think
> > > > > > > > that
> > > > > > > > > you may ask more detailed question.
> > > > > > > > >
> > > > > > > > > Here, I'll roughly list some important classes for
> aggregate
> > > > > > > > > implementation.
> > > > > > > > >
> > > > > > > > >    - SQLParser.g4 contains our SQL parsing rules. It is
> > written
> > > > in
> > > > > > > antlr.
> > > > > > > > >    - SQLAnalyzer is our parser based on rules defined at
> > > > > > SQLParser.g4.
> > > > > > > > >    - SQLAnalyzer translates a SQL query into a tree of Expr
> > > which
> > > > > > > > >    represents an algebraic expression.
> > > > > > > > >    - LogicalPlanner translates the Expr tree into a
> > LogicalPlan
> > > > > that
> > > > > > > > >    logically describes how the given query will be
> executed.
> > > > > > > > >    - GlobalPlanner translates the LogicalPlan into a
> > MasterPlan
> > > > > > > > >    (distributed query execution plan) that describes how
> the
> > > > given
> > > > > > > query
> > > > > > > > > will
> > > > > > > > >    be executed in distributed cluster.
> > > > > > > > >    - Once a MasterPlan is created, QueryMaster starts to
> > > execute
> > > > > > query
> > > > > > > > >    processing. A query consists of multiple stages, which
> are
> > > > > > > > individually
> > > > > > > > >    processed in some order.
> > > > > > > > >       - For example, a simple aggregation query is executed
> > in
> > > > two
> > > > > > > > stages,
> > > > > > > > >       each of which is for parallel aggregation and
> combining
> > > > > > > aggregates.
> > > > > > > > > These
> > > > > > > > >       stages are executed sequentially.
> > > > > > > > >    - A stage is concurrently processed by multiple tasks,
> and
> > > is
> > > > > > > executed
> > > > > > > > >    by TajoWorker.
> > > > > > > > >    - Each task contains meta information for input data
> and a
> > > > > > > LogicalPlan
> > > > > > > > >    of the stage. This LogicalPlan is translated into
> > > PhysicalExec
> > > > > by
> > > > > > > > >    PhysicalPlanner.
> > > > > > > > >    - PhysicalExec describes how the query is actually
> > executed.
> > > > > > > > >       - For example, there are two types of
> AggregationExec,
> > > > > > > > >       i.e., HashAggregateExec and SortAggregateExec, for
> > > > hash-based
> > > > > > > > > aggregation
> > > > > > > > >       and sort-based aggregation, respectively.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Jihoon
> > > > > > > > >
> > > > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <
> atri.jiit@gmail.com
> > >님이
> > > > 작성:
> > > > > > > > >
> > > > > > > > > > Folks,
> > > > > > > > > >
> > > > > > > > > > I am looking into parallel aggregates/combining
> > aggregates. I
> > > > > have
> > > > > > a
> > > > > > > > plan
> > > > > > > > > > around it which I think can work.
> > > > > > > > > >
> > > > > > > > > > Please update me on current infrastructure and point me
> > > around
> > > > > the
> > > > > > > > > existing
> > > > > > > > > > code base. Also, ideas would be most welcome around it.
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > > *l'apprenant*
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Jihoon Son <gh...@gmail.com>.

As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for data
cube and grouping sets.
You can create any issues if you want. Even though some issues can be
duplicated, it's ok.
2015년 6월 18일 (목) 오전 10:39, Atri Sharma <at...@gmail.com>님이 작성:

> Do we have a ticket around that?
> On 18 Jun 2015 15:07, "Jihoon Son" <ji...@apache.org> wrote:
>
> > It looks good to start.
> > Any questions welcome!
> >
> > Jihoon
> >
> > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <at...@gmail.com>님이 작성:
> >
> > > So distinct aggregation is one area, thanks.
> > >
> > > I am trying to get enough knowledge of Internals of aggregation engine
> > and
> > > query planner to be able to work on rollup and cube so picking smaller
> > > tickets first.
> > > On 18 Jun 2015 02:42, "Jihoon Son" <ji...@apache.org> wrote:
> > >
> > > > As far as I know, there aren't any plans for improvement except in
> > > distinct
> > > > aggregation. I think that our code for distinct aggregation is too
> > > > complicated, and the performance also should be improved.
> > > >
> > > > So, when you design the implementation of your algorithm on Tajo, you
> > > don't
> > > > have to consider distinct aggregation part, I think.
> > > >
> > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <at...@gmail.com>님이 작성:
> > > >
> > > > > Thank you.
> > > > >
> > > > > Is there any improvement in aggregates that we are looking at
> please?
> > > > > On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org> wrote:
> > > > >
> > > > > > In Tajo, aggregation is very similar to that in Hadoop MapReduce.
> > > > > > Let me consider an example. Given a query of "select *k*,
> count(*)
> > > from
> > > > > *t*
> > > > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > > > >
> > > > > > group by (k)
> > > > > >        |
> > > > > >    scan (t)
> > > > > >
> > > > > > This LogicalPlan is translated into a MasterPlan as follows.
> > > > > >
> > > > > > -----------------
> > > > > >      Stage2
> > > > > >   group by *k*
> > > > > > -----------------
> > > > > >           |
> > > > > > shuffle tuples with *k*
> > > > > >           |
> > > > > > -----------------
> > > > > >      Stage1
> > > > > >   group by *k*
> > > > > >          |
> > > > > >     scan *t*
> > > > > > -----------------
> > > > > >
> > > > > > As you can see in this example, the query plan consists of 2
> > stages.
> > > > Each
> > > > > > stage is executed subsequently because the result of Stage 1 is
> > used
> > > as
> > > > > the
> > > > > > input of Stage 2. Each stage is divided into multiple tasks for
> > each
> > > > > input
> > > > > > split as follows.
> > > > > >
> > > > > > Stage1
> > > > > >
> > > > > > Task 1
> > > > > > group by *k*
> > > > > >        |
> > > > > >   scan *t* (0 - 99)
> > > > > >
> > > > > > Task 2
> > > > > > group by *k*
> > > > > >        |
> > > > > >   scan *t* (100 - 199)
> > > > > > ...
> > > > > >
> > > > > > Each task is executed by a TajoWorker. As you can see, tasks of
> the
> > > > first
> > > > > > stage execute a local aggregation after scanning input split.
> This
> > > > local
> > > > > > aggregation result is shuffled among TajoWorkers with the
> > aggregation
> > > > key
> > > > > > *k*. Then, the final aggregation is computed at the second stage.
> > > > > >
> > > > > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The
> > > local
> > > > > > aggregation of Stage1 is similar to the Combiner of Hadoop
> > MapReduce.
> > > > > >
> > > > > > I hope that this will be helpful to you.
> > > > > > If you have any further questions, please feel free to ask.
> > > > > > Jihoon
> > > > > >
> > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <at...@gmail.com>님이
> 작성:
> > > > > >
> > > > > > Thanks.
> > > > > > >
> > > > > > > What are your thoughts on parallel aggregation? Generating
> query
> > > > plans
> > > > > > that
> > > > > > > allow states to be generated which can be executed
> independently
> > > and
> > > > > then
> > > > > > > states recombined?
> > > > > > > On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org>
> wrote:
> > > > > > >
> > > > > > > > Hi Atri, thanks for your question.
> > > > > > > >
> > > > > > > > First of all, maybe you already did, I recommend that you
> read
> > > this
> > > > > > > article
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > > >
> > > > > > > > before you start implementation. This is written by Hyunsik,
> > and
> > > > > > contains
> > > > > > > > the description of Tajo's overall infrastructure.
> Afterwards, I
> > > > think
> > > > > > > that
> > > > > > > > you may ask more detailed question.
> > > > > > > >
> > > > > > > > Here, I'll roughly list some important classes for aggregate
> > > > > > > > implementation.
> > > > > > > >
> > > > > > > >    - SQLParser.g4 contains our SQL parsing rules. It is
> written
> > > in
> > > > > > antlr.
> > > > > > > >    - SQLAnalyzer is our parser based on rules defined at
> > > > > SQLParser.g4.
> > > > > > > >    - SQLAnalyzer translates a SQL query into a tree of Expr
> > which
> > > > > > > >    represents an algebraic expression.
> > > > > > > >    - LogicalPlanner translates the Expr tree into a
> LogicalPlan
> > > > that
> > > > > > > >    logically describes how the given query will be executed.
> > > > > > > >    - GlobalPlanner translates the LogicalPlan into a
> MasterPlan
> > > > > > > >    (distributed query execution plan) that describes how the
> > > given
> > > > > > query
> > > > > > > > will
> > > > > > > >    be executed in distributed cluster.
> > > > > > > >    - Once a MasterPlan is created, QueryMaster starts to
> > execute
> > > > > query
> > > > > > > >    processing. A query consists of multiple stages, which are
> > > > > > > individually
> > > > > > > >    processed in some order.
> > > > > > > >       - For example, a simple aggregation query is executed
> in
> > > two
> > > > > > > stages,
> > > > > > > >       each of which is for parallel aggregation and combining
> > > > > > aggregates.
> > > > > > > > These
> > > > > > > >       stages are executed sequentially.
> > > > > > > >    - A stage is concurrently processed by multiple tasks, and
> > is
> > > > > > executed
> > > > > > > >    by TajoWorker.
> > > > > > > >    - Each task contains meta information for input data and a
> > > > > > LogicalPlan
> > > > > > > >    of the stage. This LogicalPlan is translated into
> > PhysicalExec
> > > > by
> > > > > > > >    PhysicalPlanner.
> > > > > > > >    - PhysicalExec describes how the query is actually
> executed.
> > > > > > > >       - For example, there are two types of AggregationExec,
> > > > > > > >       i.e., HashAggregateExec and SortAggregateExec, for
> > > hash-based
> > > > > > > > aggregation
> > > > > > > >       and sort-based aggregation, respectively.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jihoon
> > > > > > > >
> > > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <atri.jiit@gmail.com
> >님이
> > > 작성:
> > > > > > > >
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > I am looking into parallel aggregates/combining
> aggregates. I
> > > > have
> > > > > a
> > > > > > > plan
> > > > > > > > > around it which I think can work.
> > > > > > > > >
> > > > > > > > > Please update me on current infrastructure and point me
> > around
> > > > the
> > > > > > > > existing
> > > > > > > > > code base. Also, ideas would be most welcome around it.
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > *l'apprenant*
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Atri Sharma <at...@gmail.com>.

Do we have a ticket around that?
On 18 Jun 2015 15:07, "Jihoon Son" <ji...@apache.org> wrote:

> It looks good to start.
> Any questions welcome!
>
> Jihoon
>
> 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <at...@gmail.com>님이 작성:
>
> > So distinct aggregation is one area, thanks.
> >
> > I am trying to get enough knowledge of Internals of aggregation engine
> and
> > query planner to be able to work on rollup and cube so picking smaller
> > tickets first.
> > On 18 Jun 2015 02:42, "Jihoon Son" <ji...@apache.org> wrote:
> >
> > > As far as I know, there aren't any plans for improvement except in
> > distinct
> > > aggregation. I think that our code for distinct aggregation is too
> > > complicated, and the performance also should be improved.
> > >
> > > So, when you design the implementation of your algorithm on Tajo, you
> > don't
> > > have to consider distinct aggregation part, I think.
> > >
> > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <at...@gmail.com>님이 작성:
> > >
> > > > Thank you.
> > > >
> > > > Is there any improvement in aggregates that we are looking at please?
> > > > On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org> wrote:
> > > >
> > > > > In Tajo, aggregation is very similar to that in Hadoop MapReduce.
> > > > > Let me consider an example. Given a query of "select *k*, count(*)
> > from
> > > > *t*
> > > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > > >
> > > > > group by (k)
> > > > >        |
> > > > >    scan (t)
> > > > >
> > > > > This LogicalPlan is translated into a MasterPlan as follows.
> > > > >
> > > > > -----------------
> > > > >      Stage2
> > > > >   group by *k*
> > > > > -----------------
> > > > >           |
> > > > > shuffle tuples with *k*
> > > > >           |
> > > > > -----------------
> > > > >      Stage1
> > > > >   group by *k*
> > > > >          |
> > > > >     scan *t*
> > > > > -----------------
> > > > >
> > > > > As you can see in this example, the query plan consists of 2
> stages.
> > > Each
> > > > > stage is executed subsequently because the result of Stage 1 is
> used
> > as
> > > > the
> > > > > input of Stage 2. Each stage is divided into multiple tasks for
> each
> > > > input
> > > > > split as follows.
> > > > >
> > > > > Stage1
> > > > >
> > > > > Task 1
> > > > > group by *k*
> > > > >        |
> > > > >   scan *t* (0 - 99)
> > > > >
> > > > > Task 2
> > > > > group by *k*
> > > > >        |
> > > > >   scan *t* (100 - 199)
> > > > > ...
> > > > >
> > > > > Each task is executed by a TajoWorker. As you can see, tasks of the
> > > first
> > > > > stage execute a local aggregation after scanning input split. This
> > > local
> > > > > aggregation result is shuffled among TajoWorkers with the
> aggregation
> > > key
> > > > > *k*. Then, the final aggregation is computed at the second stage.
> > > > >
> > > > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The
> > local
> > > > > aggregation of Stage1 is similar to the Combiner of Hadoop
> MapReduce.
> > > > >
> > > > > I hope that this will be helpful to you.
> > > > > If you have any further questions, please feel free to ask.
> > > > > Jihoon
> > > > >
> > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <at...@gmail.com>님이 작성:
> > > > >
> > > > > Thanks.
> > > > > >
> > > > > > What are your thoughts on parallel aggregation? Generating query
> > > plans
> > > > > that
> > > > > > allow states to be generated which can be executed independently
> > and
> > > > then
> > > > > > states recombined?
> > > > > > On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org> wrote:
> > > > > >
> > > > > > > Hi Atri, thanks for your question.
> > > > > > >
> > > > > > > First of all, maybe you already did, I recommend that you read
> > this
> > > > > > article
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > >
> > > > > > > before you start implementation. This is written by Hyunsik,
> and
> > > > > contains
> > > > > > > the description of Tajo's overall infrastructure. Afterwards, I
> > > think
> > > > > > that
> > > > > > > you may ask more detailed question.
> > > > > > >
> > > > > > > Here, I'll roughly list some important classes for aggregate
> > > > > > > implementation.
> > > > > > >
> > > > > > >    - SQLParser.g4 contains our SQL parsing rules. It is written
> > in
> > > > > antlr.
> > > > > > >    - SQLAnalyzer is our parser based on rules defined at
> > > > SQLParser.g4.
> > > > > > >    - SQLAnalyzer translates a SQL query into a tree of Expr
> which
> > > > > > >    represents an algebraic expression.
> > > > > > >    - LogicalPlanner translates the Expr tree into a LogicalPlan
> > > that
> > > > > > >    logically describes how the given query will be executed.
> > > > > > >    - GlobalPlanner translates the LogicalPlan into a MasterPlan
> > > > > > >    (distributed query execution plan) that describes how the
> > given
> > > > > query
> > > > > > > will
> > > > > > >    be executed in distributed cluster.
> > > > > > >    - Once a MasterPlan is created, QueryMaster starts to
> execute
> > > > query
> > > > > > >    processing. A query consists of multiple stages, which are
> > > > > > individually
> > > > > > >    processed in some order.
> > > > > > >       - For example, a simple aggregation query is executed in
> > two
> > > > > > stages,
> > > > > > >       each of which is for parallel aggregation and combining
> > > > > aggregates.
> > > > > > > These
> > > > > > >       stages are executed sequentially.
> > > > > > >    - A stage is concurrently processed by multiple tasks, and
> is
> > > > > executed
> > > > > > >    by TajoWorker.
> > > > > > >    - Each task contains meta information for input data and a
> > > > > LogicalPlan
> > > > > > >    of the stage. This LogicalPlan is translated into
> PhysicalExec
> > > by
> > > > > > >    PhysicalPlanner.
> > > > > > >    - PhysicalExec describes how the query is actually executed.
> > > > > > >       - For example, there are two types of AggregationExec,
> > > > > > >       i.e., HashAggregateExec and SortAggregateExec, for
> > hash-based
> > > > > > > aggregation
> > > > > > >       and sort-based aggregation, respectively.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Jihoon
> > > > > > >
> > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <at...@gmail.com>님이
> > 작성:
> > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > I am looking into parallel aggregates/combining aggregates. I
> > > have
> > > > a
> > > > > > plan
> > > > > > > > around it which I think can work.
> > > > > > > >
> > > > > > > > Please update me on current infrastructure and point me
> around
> > > the
> > > > > > > existing
> > > > > > > > code base. Also, ideas would be most welcome around it.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Atri
> > > > > > > > *l'apprenant*
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Jihoon Son <ji...@apache.org>.

It looks good to start.
Any questions welcome!

Jihoon

2015년 6월 18일 (목) 오전 3:39, Atri Sharma <at...@gmail.com>님이 작성:

> So distinct aggregation is one area, thanks.
>
> I am trying to get enough knowledge of Internals of aggregation engine and
> query planner to be able to work on rollup and cube so picking smaller
> tickets first.
> On 18 Jun 2015 02:42, "Jihoon Son" <ji...@apache.org> wrote:
>
> > As far as I know, there aren't any plans for improvement except in
> distinct
> > aggregation. I think that our code for distinct aggregation is too
> > complicated, and the performance also should be improved.
> >
> > So, when you design the implementation of your algorithm on Tajo, you
> don't
> > have to consider distinct aggregation part, I think.
> >
> > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <at...@gmail.com>님이 작성:
> >
> > > Thank you.
> > >
> > > Is there any improvement in aggregates that we are looking at please?
> > > On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org> wrote:
> > >
> > > > In Tajo, aggregation is very similar to that in Hadoop MapReduce.
> > > > Let me consider an example. Given a query of "select *k*, count(*)
> from
> > > *t*
> > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > >
> > > > group by (k)
> > > >        |
> > > >    scan (t)
> > > >
> > > > This LogicalPlan is translated into a MasterPlan as follows.
> > > >
> > > > -----------------
> > > >      Stage2
> > > >   group by *k*
> > > > -----------------
> > > >           |
> > > > shuffle tuples with *k*
> > > >           |
> > > > -----------------
> > > >      Stage1
> > > >   group by *k*
> > > >          |
> > > >     scan *t*
> > > > -----------------
> > > >
> > > > As you can see in this example, the query plan consists of 2 stages.
> > Each
> > > > stage is executed subsequently because the result of Stage 1 is used
> as
> > > the
> > > > input of Stage 2. Each stage is divided into multiple tasks for each
> > > input
> > > > split as follows.
> > > >
> > > > Stage1
> > > >
> > > > Task 1
> > > > group by *k*
> > > >        |
> > > >   scan *t* (0 - 99)
> > > >
> > > > Task 2
> > > > group by *k*
> > > >        |
> > > >   scan *t* (100 - 199)
> > > > ...
> > > >
> > > > Each task is executed by a TajoWorker. As you can see, tasks of the
> > first
> > > > stage execute a local aggregation after scanning input split. This
> > local
> > > > aggregation result is shuffled among TajoWorkers with the aggregation
> > key
> > > > *k*. Then, the final aggregation is computed at the second stage.
> > > >
> > > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The
> local
> > > > aggregation of Stage1 is similar to the Combiner of Hadoop MapReduce.
> > > >
> > > > I hope that this will be helpful to you.
> > > > If you have any further questions, please feel free to ask.
> > > > Jihoon
> > > >
> > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <at...@gmail.com>님이 작성:
> > > >
> > > > Thanks.
> > > > >
> > > > > What are your thoughts on parallel aggregation? Generating query
> > plans
> > > > that
> > > > > allow states to be generated which can be executed independently
> and
> > > then
> > > > > states recombined?
> > > > > On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org> wrote:
> > > > >
> > > > > > Hi Atri, thanks for your question.
> > > > > >
> > > > > > First of all, maybe you already did, I recommend that you read
> this
> > > > > article
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > >
> > > > > > before you start implementation. This is written by Hyunsik, and
> > > > contains
> > > > > > the description of Tajo's overall infrastructure. Afterwards, I
> > think
> > > > > that
> > > > > > you may ask more detailed question.
> > > > > >
> > > > > > Here, I'll roughly list some important classes for aggregate
> > > > > > implementation.
> > > > > >
> > > > > >    - SQLParser.g4 contains our SQL parsing rules. It is written
> in
> > > > antlr.
> > > > > >    - SQLAnalyzer is our parser based on rules defined at
> > > SQLParser.g4.
> > > > > >    - SQLAnalyzer translates a SQL query into a tree of Expr which
> > > > > >    represents an algebraic expression.
> > > > > >    - LogicalPlanner translates the Expr tree into a LogicalPlan
> > that
> > > > > >    logically describes how the given query will be executed.
> > > > > >    - GlobalPlanner translates the LogicalPlan into a MasterPlan
> > > > > >    (distributed query execution plan) that describes how the
> given
> > > > query
> > > > > > will
> > > > > >    be executed in distributed cluster.
> > > > > >    - Once a MasterPlan is created, QueryMaster starts to execute
> > > query
> > > > > >    processing. A query consists of multiple stages, which are
> > > > > individually
> > > > > >    processed in some order.
> > > > > >       - For example, a simple aggregation query is executed in
> two
> > > > > stages,
> > > > > >       each of which is for parallel aggregation and combining
> > > > aggregates.
> > > > > > These
> > > > > >       stages are executed sequentially.
> > > > > >    - A stage is concurrently processed by multiple tasks, and is
> > > > executed
> > > > > >    by TajoWorker.
> > > > > >    - Each task contains meta information for input data and a
> > > > LogicalPlan
> > > > > >    of the stage. This LogicalPlan is translated into PhysicalExec
> > by
> > > > > >    PhysicalPlanner.
> > > > > >    - PhysicalExec describes how the query is actually executed.
> > > > > >       - For example, there are two types of AggregationExec,
> > > > > >       i.e., HashAggregateExec and SortAggregateExec, for
> hash-based
> > > > > > aggregation
> > > > > >       and sort-based aggregation, respectively.
> > > > > >
> > > > > > Best regards,
> > > > > > Jihoon
> > > > > >
> > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <at...@gmail.com>님이
> 작성:
> > > > > >
> > > > > > > Folks,
> > > > > > >
> > > > > > > I am looking into parallel aggregates/combining aggregates. I
> > have
> > > a
> > > > > plan
> > > > > > > around it which I think can work.
> > > > > > >
> > > > > > > Please update me on current infrastructure and point me around
> > the
> > > > > > existing
> > > > > > > code base. Also, ideas would be most welcome around it.
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Atri
> > > > > > > *l'apprenant*
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Atri Sharma <at...@gmail.com>.

So distinct aggregation is one area, thanks.

I am trying to get enough knowledge of Internals of aggregation engine and
query planner to be able to work on rollup and cube so picking smaller
tickets first.
On 18 Jun 2015 02:42, "Jihoon Son" <ji...@apache.org> wrote:

> As far as I know, there aren't any plans for improvement except in distinct
> aggregation. I think that our code for distinct aggregation is too
> complicated, and the performance also should be improved.
>
> So, when you design the implementation of your algorithm on Tajo, you don't
> have to consider distinct aggregation part, I think.
>
> 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <at...@gmail.com>님이 작성:
>
> > Thank you.
> >
> > Is there any improvement in aggregates that we are looking at please?
> > On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org> wrote:
> >
> > > In Tajo, aggregation is very similar to that in Hadoop MapReduce.
> > > Let me consider an example. Given a query of "select *k*, count(*) from
> > *t*
> > > group by *k*", Tajo generates a LogicalPlan as follows.
> > >
> > > group by (k)
> > >        |
> > >    scan (t)
> > >
> > > This LogicalPlan is translated into a MasterPlan as follows.
> > >
> > > -----------------
> > >      Stage2
> > >   group by *k*
> > > -----------------
> > >           |
> > > shuffle tuples with *k*
> > >           |
> > > -----------------
> > >      Stage1
> > >   group by *k*
> > >          |
> > >     scan *t*
> > > -----------------
> > >
> > > As you can see in this example, the query plan consists of 2 stages.
> Each
> > > stage is executed subsequently because the result of Stage 1 is used as
> > the
> > > input of Stage 2. Each stage is divided into multiple tasks for each
> > input
> > > split as follows.
> > >
> > > Stage1
> > >
> > > Task 1
> > > group by *k*
> > >        |
> > >   scan *t* (0 - 99)
> > >
> > > Task 2
> > > group by *k*
> > >        |
> > >   scan *t* (100 - 199)
> > > ...
> > >
> > > Each task is executed by a TajoWorker. As you can see, tasks of the
> first
> > > stage execute a local aggregation after scanning input split. This
> local
> > > aggregation result is shuffled among TajoWorkers with the aggregation
> key
> > > *k*. Then, the final aggregation is computed at the second stage.
> > >
> > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The local
> > > aggregation of Stage1 is similar to the Combiner of Hadoop MapReduce.
> > >
> > > I hope that this will be helpful to you.
> > > If you have any further questions, please feel free to ask.
> > > Jihoon
> > >
> > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <at...@gmail.com>님이 작성:
> > >
> > > Thanks.
> > > >
> > > > What are your thoughts on parallel aggregation? Generating query
> plans
> > > that
> > > > allow states to be generated which can be executed independently and
> > then
> > > > states recombined?
> > > > On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org> wrote:
> > > >
> > > > > Hi Atri, thanks for your question.
> > > > >
> > > > > First of all, maybe you already did, I recommend that you read this
> > > > article
> > > > > <
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > >
> > > > > before you start implementation. This is written by Hyunsik, and
> > > contains
> > > > > the description of Tajo's overall infrastructure. Afterwards, I
> think
> > > > that
> > > > > you may ask more detailed question.
> > > > >
> > > > > Here, I'll roughly list some important classes for aggregate
> > > > > implementation.
> > > > >
> > > > >    - SQLParser.g4 contains our SQL parsing rules. It is written in
> > > antlr.
> > > > >    - SQLAnalyzer is our parser based on rules defined at
> > SQLParser.g4.
> > > > >    - SQLAnalyzer translates a SQL query into a tree of Expr which
> > > > >    represents an algebraic expression.
> > > > >    - LogicalPlanner translates the Expr tree into a LogicalPlan
> that
> > > > >    logically describes how the given query will be executed.
> > > > >    - GlobalPlanner translates the LogicalPlan into a MasterPlan
> > > > >    (distributed query execution plan) that describes how the given
> > > query
> > > > > will
> > > > >    be executed in distributed cluster.
> > > > >    - Once a MasterPlan is created, QueryMaster starts to execute
> > query
> > > > >    processing. A query consists of multiple stages, which are
> > > > individually
> > > > >    processed in some order.
> > > > >       - For example, a simple aggregation query is executed in two
> > > > stages,
> > > > >       each of which is for parallel aggregation and combining
> > > aggregates.
> > > > > These
> > > > >       stages are executed sequentially.
> > > > >    - A stage is concurrently processed by multiple tasks, and is
> > > executed
> > > > >    by TajoWorker.
> > > > >    - Each task contains meta information for input data and a
> > > LogicalPlan
> > > > >    of the stage. This LogicalPlan is translated into PhysicalExec
> by
> > > > >    PhysicalPlanner.
> > > > >    - PhysicalExec describes how the query is actually executed.
> > > > >       - For example, there are two types of AggregationExec,
> > > > >       i.e., HashAggregateExec and SortAggregateExec, for hash-based
> > > > > aggregation
> > > > >       and sort-based aggregation, respectively.
> > > > >
> > > > > Best regards,
> > > > > Jihoon
> > > > >
> > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <at...@gmail.com>님이 작성:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > I am looking into parallel aggregates/combining aggregates. I
> have
> > a
> > > > plan
> > > > > > around it which I think can work.
> > > > > >
> > > > > > Please update me on current infrastructure and point me around
> the
> > > > > existing
> > > > > > code base. Also, ideas would be most welcome around it.
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > *l'apprenant*
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Jihoon Son <ji...@apache.org>.

As far as I know, there aren't any plans for improvement except in distinct
aggregation. I think that our code for distinct aggregation is too
complicated, and the performance also should be improved.

So, when you design the implementation of your algorithm on Tajo, you don't
have to consider distinct aggregation part, I think.

2015년 6월 18일 (목) 오전 2:16, Atri Sharma <at...@gmail.com>님이 작성:

> Thank you.
>
> Is there any improvement in aggregates that we are looking at please?
> On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org> wrote:
>
> > In Tajo, aggregation is very similar to that in Hadoop MapReduce.
> > Let me consider an example. Given a query of "select *k*, count(*) from
> *t*
> > group by *k*", Tajo generates a LogicalPlan as follows.
> >
> > group by (k)
> >        |
> >    scan (t)
> >
> > This LogicalPlan is translated into a MasterPlan as follows.
> >
> > -----------------
> >      Stage2
> >   group by *k*
> > -----------------
> >           |
> > shuffle tuples with *k*
> >           |
> > -----------------
> >      Stage1
> >   group by *k*
> >          |
> >     scan *t*
> > -----------------
> >
> > As you can see in this example, the query plan consists of 2 stages. Each
> > stage is executed subsequently because the result of Stage 1 is used as
> the
> > input of Stage 2. Each stage is divided into multiple tasks for each
> input
> > split as follows.
> >
> > Stage1
> >
> > Task 1
> > group by *k*
> >        |
> >   scan *t* (0 - 99)
> >
> > Task 2
> > group by *k*
> >        |
> >   scan *t* (100 - 199)
> > ...
> >
> > Each task is executed by a TajoWorker. As you can see, tasks of the first
> > stage execute a local aggregation after scanning input split. This local
> > aggregation result is shuffled among TajoWorkers with the aggregation key
> > *k*. Then, the final aggregation is computed at the second stage.
> >
> > Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The local
> > aggregation of Stage1 is similar to the Combiner of Hadoop MapReduce.
> >
> > I hope that this will be helpful to you.
> > If you have any further questions, please feel free to ask.
> > Jihoon
> >
> > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <at...@gmail.com>님이 작성:
> >
> > Thanks.
> > >
> > > What are your thoughts on parallel aggregation? Generating query plans
> > that
> > > allow states to be generated which can be executed independently and
> then
> > > states recombined?
> > > On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org> wrote:
> > >
> > > > Hi Atri, thanks for your question.
> > > >
> > > > First of all, maybe you already did, I recommend that you read this
> > > article
> > > > <
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > >
> > > > before you start implementation. This is written by Hyunsik, and
> > contains
> > > > the description of Tajo's overall infrastructure. Afterwards, I think
> > > that
> > > > you may ask more detailed question.
> > > >
> > > > Here, I'll roughly list some important classes for aggregate
> > > > implementation.
> > > >
> > > >    - SQLParser.g4 contains our SQL parsing rules. It is written in
> > antlr.
> > > >    - SQLAnalyzer is our parser based on rules defined at
> SQLParser.g4.
> > > >    - SQLAnalyzer translates a SQL query into a tree of Expr which
> > > >    represents an algebraic expression.
> > > >    - LogicalPlanner translates the Expr tree into a LogicalPlan that
> > > >    logically describes how the given query will be executed.
> > > >    - GlobalPlanner translates the LogicalPlan into a MasterPlan
> > > >    (distributed query execution plan) that describes how the given
> > query
> > > > will
> > > >    be executed in distributed cluster.
> > > >    - Once a MasterPlan is created, QueryMaster starts to execute
> query
> > > >    processing. A query consists of multiple stages, which are
> > > individually
> > > >    processed in some order.
> > > >       - For example, a simple aggregation query is executed in two
> > > stages,
> > > >       each of which is for parallel aggregation and combining
> > aggregates.
> > > > These
> > > >       stages are executed sequentially.
> > > >    - A stage is concurrently processed by multiple tasks, and is
> > executed
> > > >    by TajoWorker.
> > > >    - Each task contains meta information for input data and a
> > LogicalPlan
> > > >    of the stage. This LogicalPlan is translated into PhysicalExec by
> > > >    PhysicalPlanner.
> > > >    - PhysicalExec describes how the query is actually executed.
> > > >       - For example, there are two types of AggregationExec,
> > > >       i.e., HashAggregateExec and SortAggregateExec, for hash-based
> > > > aggregation
> > > >       and sort-based aggregation, respectively.
> > > >
> > > > Best regards,
> > > > Jihoon
> > > >
> > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <at...@gmail.com>님이 작성:
> > > >
> > > > > Folks,
> > > > >
> > > > > I am looking into parallel aggregates/combining aggregates. I have
> a
> > > plan
> > > > > around it which I think can work.
> > > > >
> > > > > Please update me on current infrastructure and point me around the
> > > > existing
> > > > > code base. Also, ideas would be most welcome around it.
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Atri
> > > > > *l'apprenant*
> > > > >
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Atri Sharma <at...@gmail.com>.

Thank you.

Is there any improvement in aggregates that we are looking at please?
On 16 Jun 2015 17:07, "Jihoon Son" <ji...@apache.org> wrote:

> In Tajo, aggregation is very similar to that in Hadoop MapReduce.
> Let me consider an example. Given a query of "select *k*, count(*) from *t*
> group by *k*", Tajo generates a LogicalPlan as follows.
>
> group by (k)
>        |
>    scan (t)
>
> This LogicalPlan is translated into a MasterPlan as follows.
>
> -----------------
>      Stage2
>   group by *k*
> -----------------
>           |
> shuffle tuples with *k*
>           |
> -----------------
>      Stage1
>   group by *k*
>          |
>     scan *t*
> -----------------
>
> As you can see in this example, the query plan consists of 2 stages. Each
> stage is executed subsequently because the result of Stage 1 is used as the
> input of Stage 2. Each stage is divided into multiple tasks for each input
> split as follows.
>
> Stage1
>
> Task 1
> group by *k*
>        |
>   scan *t* (0 - 99)
>
> Task 2
> group by *k*
>        |
>   scan *t* (100 - 199)
> ...
>
> Each task is executed by a TajoWorker. As you can see, tasks of the first
> stage execute a local aggregation after scanning input split. This local
> aggregation result is shuffled among TajoWorkers with the aggregation key
> *k*. Then, the final aggregation is computed at the second stage.
>
> Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The local
> aggregation of Stage1 is similar to the Combiner of Hadoop MapReduce.
>
> I hope that this will be helpful to you.
> If you have any further questions, please feel free to ask.
> Jihoon
>
> 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <at...@gmail.com>님이 작성:
>
> Thanks.
> >
> > What are your thoughts on parallel aggregation? Generating query plans
> that
> > allow states to be generated which can be executed independently and then
> > states recombined?
> > On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org> wrote:
> >
> > > Hi Atri, thanks for your question.
> > >
> > > First of all, maybe you already did, I recommend that you read this
> > article
> > > <
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > >
> > > before you start implementation. This is written by Hyunsik, and
> contains
> > > the description of Tajo's overall infrastructure. Afterwards, I think
> > that
> > > you may ask more detailed question.
> > >
> > > Here, I'll roughly list some important classes for aggregate
> > > implementation.
> > >
> > >    - SQLParser.g4 contains our SQL parsing rules. It is written in
> antlr.
> > >    - SQLAnalyzer is our parser based on rules defined at SQLParser.g4.
> > >    - SQLAnalyzer translates a SQL query into a tree of Expr which
> > >    represents an algebraic expression.
> > >    - LogicalPlanner translates the Expr tree into a LogicalPlan that
> > >    logically describes how the given query will be executed.
> > >    - GlobalPlanner translates the LogicalPlan into a MasterPlan
> > >    (distributed query execution plan) that describes how the given
> query
> > > will
> > >    be executed in distributed cluster.
> > >    - Once a MasterPlan is created, QueryMaster starts to execute query
> > >    processing. A query consists of multiple stages, which are
> > individually
> > >    processed in some order.
> > >       - For example, a simple aggregation query is executed in two
> > stages,
> > >       each of which is for parallel aggregation and combining
> aggregates.
> > > These
> > >       stages are executed sequentially.
> > >    - A stage is concurrently processed by multiple tasks, and is
> executed
> > >    by TajoWorker.
> > >    - Each task contains meta information for input data and a
> LogicalPlan
> > >    of the stage. This LogicalPlan is translated into PhysicalExec by
> > >    PhysicalPlanner.
> > >    - PhysicalExec describes how the query is actually executed.
> > >       - For example, there are two types of AggregationExec,
> > >       i.e., HashAggregateExec and SortAggregateExec, for hash-based
> > > aggregation
> > >       and sort-based aggregation, respectively.
> > >
> > > Best regards,
> > > Jihoon
> > >
> > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <at...@gmail.com>님이 작성:
> > >
> > > > Folks,
> > > >
> > > > I am looking into parallel aggregates/combining aggregates. I have a
> > plan
> > > > around it which I think can work.
> > > >
> > > > Please update me on current infrastructure and point me around the
> > > existing
> > > > code base. Also, ideas would be most welcome around it.
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > *l'apprenant*
> > > >
> > >
> >
>

Re: Parallel Aggregates

Posted by Jihoon Son <ji...@apache.org>.

In Tajo, aggregation is very similar to that in Hadoop MapReduce.
Let me consider an example. Given a query of "select *k*, count(*) from *t*
group by *k*", Tajo generates a LogicalPlan as follows.

group by (k)
       |
   scan (t)

This LogicalPlan is translated into a MasterPlan as follows.

-----------------
     Stage2
  group by *k*
-----------------
          |
shuffle tuples with *k*
          |
-----------------
     Stage1
  group by *k*
         |
    scan *t*
-----------------

As you can see in this example, the query plan consists of 2 stages. Each
stage is executed subsequently because the result of Stage 1 is used as the
input of Stage 2. Each stage is divided into multiple tasks for each input
split as follows.

Stage1

Task 1
group by *k*
       |
  scan *t* (0 - 99)

Task 2
group by *k*
       |
  scan *t* (100 - 199)
...

Each task is executed by a TajoWorker. As you can see, tasks of the first
stage execute a local aggregation after scanning input split. This local
aggregation result is shuffled among TajoWorkers with the aggregation key
*k*. Then, the final aggregation is computed at the second stage.

Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The local
aggregation of Stage1 is similar to the Combiner of Hadoop MapReduce.

I hope that this will be helpful to you.
If you have any further questions, please feel free to ask.
Jihoon

2015년 6월 16일 (화) 오전 7:28, Atri Sharma <at...@gmail.com>님이 작성:

Thanks.
>
> What are your thoughts on parallel aggregation? Generating query plans that
> allow states to be generated which can be executed independently and then
> states recombined?
> On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org> wrote:
>
> > Hi Atri, thanks for your question.
> >
> > First of all, maybe you already did, I recommend that you read this
> article
> > <
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > >
> > before you start implementation. This is written by Hyunsik, and contains
> > the description of Tajo's overall infrastructure. Afterwards, I think
> that
> > you may ask more detailed question.
> >
> > Here, I'll roughly list some important classes for aggregate
> > implementation.
> >
> >    - SQLParser.g4 contains our SQL parsing rules. It is written in antlr.
> >    - SQLAnalyzer is our parser based on rules defined at SQLParser.g4.
> >    - SQLAnalyzer translates a SQL query into a tree of Expr which
> >    represents an algebraic expression.
> >    - LogicalPlanner translates the Expr tree into a LogicalPlan that
> >    logically describes how the given query will be executed.
> >    - GlobalPlanner translates the LogicalPlan into a MasterPlan
> >    (distributed query execution plan) that describes how the given query
> > will
> >    be executed in distributed cluster.
> >    - Once a MasterPlan is created, QueryMaster starts to execute query
> >    processing. A query consists of multiple stages, which are
> individually
> >    processed in some order.
> >       - For example, a simple aggregation query is executed in two
> stages,
> >       each of which is for parallel aggregation and combining aggregates.
> > These
> >       stages are executed sequentially.
> >    - A stage is concurrently processed by multiple tasks, and is executed
> >    by TajoWorker.
> >    - Each task contains meta information for input data and a LogicalPlan
> >    of the stage. This LogicalPlan is translated into PhysicalExec by
> >    PhysicalPlanner.
> >    - PhysicalExec describes how the query is actually executed.
> >       - For example, there are two types of AggregationExec,
> >       i.e., HashAggregateExec and SortAggregateExec, for hash-based
> > aggregation
> >       and sort-based aggregation, respectively.
> >
> > Best regards,
> > Jihoon
> >
> > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <at...@gmail.com>님이 작성:
> >
> > > Folks,
> > >
> > > I am looking into parallel aggregates/combining aggregates. I have a
> plan
> > > around it which I think can work.
> > >
> > > Please update me on current infrastructure and point me around the
> > existing
> > > code base. Also, ideas would be most welcome around it.
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > *l'apprenant*
> > >
> >
>

Re: Parallel Aggregates

Posted by Atri Sharma <at...@gmail.com>.

Thanks.

What are your thoughts on parallel aggregation? Generating query plans that
allow states to be generated which can be executed independently and then
states recombined?
On 16 Jun 2015 05:25, "Jihoon Son" <ji...@apache.org> wrote:

> Hi Atri, thanks for your question.
>
> First of all, maybe you already did, I recommend that you read this article
> <
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> >
> before you start implementation. This is written by Hyunsik, and contains
> the description of Tajo's overall infrastructure. Afterwards, I think that
> you may ask more detailed question.
>
> Here, I'll roughly list some important classes for aggregate
> implementation.
>
>    - SQLParser.g4 contains our SQL parsing rules. It is written in antlr.
>    - SQLAnalyzer is our parser based on rules defined at SQLParser.g4.
>    - SQLAnalyzer translates a SQL query into a tree of Expr which
>    represents an algebraic expression.
>    - LogicalPlanner translates the Expr tree into a LogicalPlan that
>    logically describes how the given query will be executed.
>    - GlobalPlanner translates the LogicalPlan into a MasterPlan
>    (distributed query execution plan) that describes how the given query
> will
>    be executed in distributed cluster.
>    - Once a MasterPlan is created, QueryMaster starts to execute query
>    processing. A query consists of multiple stages, which are individually
>    processed in some order.
>       - For example, a simple aggregation query is executed in two stages,
>       each of which is for parallel aggregation and combining aggregates.
> These
>       stages are executed sequentially.
>    - A stage is concurrently processed by multiple tasks, and is executed
>    by TajoWorker.
>    - Each task contains meta information for input data and a LogicalPlan
>    of the stage. This LogicalPlan is translated into PhysicalExec by
>    PhysicalPlanner.
>    - PhysicalExec describes how the query is actually executed.
>       - For example, there are two types of AggregationExec,
>       i.e., HashAggregateExec and SortAggregateExec, for hash-based
> aggregation
>       and sort-based aggregation, respectively.
>
> Best regards,
> Jihoon
>
> 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <at...@gmail.com>님이 작성:
>
> > Folks,
> >
> > I am looking into parallel aggregates/combining aggregates. I have a plan
> > around it which I think can work.
> >
> > Please update me on current infrastructure and point me around the
> existing
> > code base. Also, ideas would be most welcome around it.
> >
> > --
> > Regards,
> >
> > Atri
> > *l'apprenant*
> >
>

Re: Parallel Aggregates

Posted by Jihoon Son <ji...@apache.org>.

Hi Atri, thanks for your question.

First of all, maybe you already did, I recommend that you read this article
<http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html>
before you start implementation. This is written by Hyunsik, and contains
the description of Tajo's overall infrastructure. Afterwards, I think that
you may ask more detailed question.

Here, I'll roughly list some important classes for aggregate implementation.

   - SQLParser.g4 contains our SQL parsing rules. It is written in antlr.
   - SQLAnalyzer is our parser based on rules defined at SQLParser.g4.
   - SQLAnalyzer translates a SQL query into a tree of Expr which
   represents an algebraic expression.
   - LogicalPlanner translates the Expr tree into a LogicalPlan that
   logically describes how the given query will be executed.
   - GlobalPlanner translates the LogicalPlan into a MasterPlan
   (distributed query execution plan) that describes how the given query will
   be executed in distributed cluster.
   - Once a MasterPlan is created, QueryMaster starts to execute query
   processing. A query consists of multiple stages, which are individually
   processed in some order.
      - For example, a simple aggregation query is executed in two stages,
      each of which is for parallel aggregation and combining aggregates. These
      stages are executed sequentially.
   - A stage is concurrently processed by multiple tasks, and is executed
   by TajoWorker.
   - Each task contains meta information for input data and a LogicalPlan
   of the stage. This LogicalPlan is translated into PhysicalExec by
   PhysicalPlanner.
   - PhysicalExec describes how the query is actually executed.
      - For example, there are two types of AggregationExec,
      i.e., HashAggregateExec and SortAggregateExec, for hash-based aggregation
      and sort-based aggregation, respectively.

Best regards,
Jihoon

2015년 6월 15일 (월) 오후 11:32, Atri Sharma <at...@gmail.com>님이 작성:

> Folks,
>
> I am looking into parallel aggregates/combining aggregates. I have a plan
> around it which I think can work.
>
> Please update me on current infrastructure and point me around the existing
> code base. Also, ideas would be most welcome around it.
>
> --
> Regards,
>
> Atri
> *l'apprenant*
>