You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Abdel Hakim Deneche <ad...@maprtech.com> on 2015/06/12 00:55:04 UTC

[DISCUSSION] How can we improve the performance of Window Functions

Hi all,

The purpose of this email is to describe how window functions are computed
and to try to come up with "better" ways to do it.

DRILL-3200 <https://issues.apache.org/jira/browse/DRILL-3200> added support
for RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK and CUME_DIST but also made
some significant  improvements to the way Drill computes window functions.

The general idea was to update the code to only support the default frame
which makes it run faster and use less memory.

WindowFrameRecordBatch works similarly to StreamingAggregate: it requires
the data to be sorted on the partition and order by columns and only
computes one frame at a time. With the default frame we only need to
aggregate every row only once.
Memory consumption depend on the data, but in general each record batch is
kept in memory until we are ready to process all it's rows (which is
possible when we find the last peer row of the batch's last row). Drill's
external sort can spill to disk if data is too big, and we only need to
keep at most one partition's worth of data in memory for the window
functions to be computed (when over clause doesn't contain an order by)

Each time a batch is ready to be processed we do the following:

1- we start with it's first row (current row)
2- we compute the length of the current row's frame (in this case we find
the number of peer rows for the current row),
3- we aggregate (this includes computing the window function values) all
rows of the current frame
4- we write the aggregated value in each row of the current frame.
5- We then move to the 1st non peer row which becomes the current row
6- if we didn't reach the end of the current batch go back to 2

With all this in mind, how can we improve the performance of window
functions ?

Thanks!
-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: [DISCUSSION] How can we improve the performance of Window Functions

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.

Thank you Ted, I will see if I'm doing something similar and try to improve
it.

Steven, you are right it looks like the sort is taking most of the
computation time.


On Thu, Jun 11, 2015 at 10:55 PM, Ted Dunning <te...@gmail.com> wrote:

> Speed in many such loops depends a lot on how the loops are ordered so that
> cache and registers can be re-used.  I have no idea what will make your
> windowing functions fast, but I can say some things about what makes matrix
> math fast.
>
> The key with matrix multiplication is that there are n^3/2 operations to do
> on n^2 elements.  The minimum number of memory operations is n^2 which
> sounds good because modern CPU's can often do hundreds of operations per
> main memory access.  This also means that if we code a naive
> implementation, we will generally be memory bound because that will
> increase the number of memory operations to k n^3.
>
> To avoid that, the loops involved can be restructured so that larger and
> larger blocks of data are used.  At the lowest levels, small blocks of 2 x
> 4 values or so are used to code the multiplication since all of these
> values can be kept in registers.  At one step up, the computation is
> structured to only operate on elements that fit in the fastest level of
> cache which is typically 10's of kB in size.
>
> Your loop looks like this:
>
> for (start = 0 ... end-n) {
>    initialize()
>    for (offset = 0 ... n-1) {
>       aggregate(start + offset)
>    }
>    finalize()
> }
>
> This arrangement is pretty cache friendly if n is small enough, but it
> seems that it could be even more friendly if you kept all of the
> aggregators at the read and handed each sample to all of the aggregators
> before moving to the next position.
>
> On Thu, Jun 11, 2015 at 3:55 PM, Abdel Hakim Deneche <
> adeneche@maprtech.com>
> wrote:
>
> > Hi all,
> >
> > The purpose of this email is to describe how window functions are
> computed
> > and to try to come up with "better" ways to do it.
> >
> > DRILL-3200 <https://issues.apache.org/jira/browse/DRILL-3200> added
> > support
> > for RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK and CUME_DIST but also
> made
> > some significant  improvements to the way Drill computes window
> functions.
> >
> > The general idea was to update the code to only support the default frame
> > which makes it run faster and use less memory.
> >
> > WindowFrameRecordBatch works similarly to StreamingAggregate: it requires
> > the data to be sorted on the partition and order by columns and only
> > computes one frame at a time. With the default frame we only need to
> > aggregate every row only once.
> > Memory consumption depend on the data, but in general each record batch
> is
> > kept in memory until we are ready to process all it's rows (which is
> > possible when we find the last peer row of the batch's last row). Drill's
> > external sort can spill to disk if data is too big, and we only need to
> > keep at most one partition's worth of data in memory for the window
> > functions to be computed (when over clause doesn't contain an order by)
> >
> > Each time a batch is ready to be processed we do the following:
> >
> > 1- we start with it's first row (current row)
> > 2- we compute the length of the current row's frame (in this case we find
> > the number of peer rows for the current row),
> > 3- we aggregate (this includes computing the window function values) all
> > rows of the current frame
> > 4- we write the aggregated value in each row of the current frame.
> > 5- We then move to the 1st non peer row which becomes the current row
> > 6- if we didn't reach the end of the current batch go back to 2
> >
> > With all this in mind, how can we improve the performance of window
> > functions ?
> >
> > Thanks!
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: [DISCUSSION] How can we improve the performance of Window Functions

Posted by Steven Phillips <sp...@maprtech.com>.

Can you give us some data on what the current performance looks like, vs
what you would expect? Are we spend most of the time in the sort, or the
Window function operator?

On Thu, Jun 11, 2015 at 10:55 PM, Ted Dunning <te...@gmail.com> wrote:

> Speed in many such loops depends a lot on how the loops are ordered so that
> cache and registers can be re-used.  I have no idea what will make your
> windowing functions fast, but I can say some things about what makes matrix
> math fast.
>
> The key with matrix multiplication is that there are n^3/2 operations to do
> on n^2 elements.  The minimum number of memory operations is n^2 which
> sounds good because modern CPU's can often do hundreds of operations per
> main memory access.  This also means that if we code a naive
> implementation, we will generally be memory bound because that will
> increase the number of memory operations to k n^3.
>
> To avoid that, the loops involved can be restructured so that larger and
> larger blocks of data are used.  At the lowest levels, small blocks of 2 x
> 4 values or so are used to code the multiplication since all of these
> values can be kept in registers.  At one step up, the computation is
> structured to only operate on elements that fit in the fastest level of
> cache which is typically 10's of kB in size.
>
> Your loop looks like this:
>
> for (start = 0 ... end-n) {
>    initialize()
>    for (offset = 0 ... n-1) {
>       aggregate(start + offset)
>    }
>    finalize()
> }
>
> This arrangement is pretty cache friendly if n is small enough, but it
> seems that it could be even more friendly if you kept all of the
> aggregators at the read and handed each sample to all of the aggregators
> before moving to the next position.
>
> On Thu, Jun 11, 2015 at 3:55 PM, Abdel Hakim Deneche <
> adeneche@maprtech.com>
> wrote:
>
> > Hi all,
> >
> > The purpose of this email is to describe how window functions are
> computed
> > and to try to come up with "better" ways to do it.
> >
> > DRILL-3200 <https://issues.apache.org/jira/browse/DRILL-3200> added
> > support
> > for RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK and CUME_DIST but also
> made
> > some significant  improvements to the way Drill computes window
> functions.
> >
> > The general idea was to update the code to only support the default frame
> > which makes it run faster and use less memory.
> >
> > WindowFrameRecordBatch works similarly to StreamingAggregate: it requires
> > the data to be sorted on the partition and order by columns and only
> > computes one frame at a time. With the default frame we only need to
> > aggregate every row only once.
> > Memory consumption depend on the data, but in general each record batch
> is
> > kept in memory until we are ready to process all it's rows (which is
> > possible when we find the last peer row of the batch's last row). Drill's
> > external sort can spill to disk if data is too big, and we only need to
> > keep at most one partition's worth of data in memory for the window
> > functions to be computed (when over clause doesn't contain an order by)
> >
> > Each time a batch is ready to be processed we do the following:
> >
> > 1- we start with it's first row (current row)
> > 2- we compute the length of the current row's frame (in this case we find
> > the number of peer rows for the current row),
> > 3- we aggregate (this includes computing the window function values) all
> > rows of the current frame
> > 4- we write the aggregated value in each row of the current frame.
> > 5- We then move to the 1st non peer row which becomes the current row
> > 6- if we didn't reach the end of the current batch go back to 2
> >
> > With all this in mind, how can we improve the performance of window
> > functions ?
> >
> > Thanks!
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: [DISCUSSION] How can we improve the performance of Window Functions

Posted by Ted Dunning <te...@gmail.com>.

Speed in many such loops depends a lot on how the loops are ordered so that
cache and registers can be re-used.  I have no idea what will make your
windowing functions fast, but I can say some things about what makes matrix
math fast.

The key with matrix multiplication is that there are n^3/2 operations to do
on n^2 elements.  The minimum number of memory operations is n^2 which
sounds good because modern CPU's can often do hundreds of operations per
main memory access.  This also means that if we code a naive
implementation, we will generally be memory bound because that will
increase the number of memory operations to k n^3.

To avoid that, the loops involved can be restructured so that larger and
larger blocks of data are used.  At the lowest levels, small blocks of 2 x
4 values or so are used to code the multiplication since all of these
values can be kept in registers.  At one step up, the computation is
structured to only operate on elements that fit in the fastest level of
cache which is typically 10's of kB in size.

Your loop looks like this:

for (start = 0 ... end-n) {
   initialize()
   for (offset = 0 ... n-1) {
      aggregate(start + offset)
   }
   finalize()
}

This arrangement is pretty cache friendly if n is small enough, but it
seems that it could be even more friendly if you kept all of the
aggregators at the read and handed each sample to all of the aggregators
before moving to the next position.

On Thu, Jun 11, 2015 at 3:55 PM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> Hi all,
>
> The purpose of this email is to describe how window functions are computed
> and to try to come up with "better" ways to do it.
>
> DRILL-3200 <https://issues.apache.org/jira/browse/DRILL-3200> added
> support
> for RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK and CUME_DIST but also made
> some significant  improvements to the way Drill computes window functions.
>
> The general idea was to update the code to only support the default frame
> which makes it run faster and use less memory.
>
> WindowFrameRecordBatch works similarly to StreamingAggregate: it requires
> the data to be sorted on the partition and order by columns and only
> computes one frame at a time. With the default frame we only need to
> aggregate every row only once.
> Memory consumption depend on the data, but in general each record batch is
> kept in memory until we are ready to process all it's rows (which is
> possible when we find the last peer row of the batch's last row). Drill's
> external sort can spill to disk if data is too big, and we only need to
> keep at most one partition's worth of data in memory for the window
> functions to be computed (when over clause doesn't contain an order by)
>
> Each time a batch is ready to be processed we do the following:
>
> 1- we start with it's first row (current row)
> 2- we compute the length of the current row's frame (in this case we find
> the number of peer rows for the current row),
> 3- we aggregate (this includes computing the window function values) all
> rows of the current frame
> 4- we write the aggregated value in each row of the current frame.
> 5- We then move to the 1st non peer row which becomes the current row
> 6- if we didn't reach the end of the current batch go back to 2
>
> With all this in mind, how can we improve the performance of window
> functions ?
>
> Thanks!
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>