You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2014/06/10 00:39:27 UTC

Talking about optimization

So D's disparagement of my half-assed fix for our egregrious
mis-performance has lead me to thinking about what a real optimizer would
do.

I think we all agree that we are attempting to maximize performance for a
flow of operations given pre-existing layouts.  Transformations include
combining and rearranging operations and using sparse iterations where
applicable.  I expect that optimizing for intermediate memory use will
happen as a side effect.

As I see it, we have some properties of the matrices and vectors and
functions that are our inputs that we could use.  We also have a relatively
small set of higher order functions that include the primitives that we
already have and a few extras.

These include at least

*Vector qualities*

- isDense
- sparsity (fraction of non-zeros)
- skewedSparse (+1 if the non-zeros are near the beginning of a vector, -1
if near the end, 0 if roughly symmetric)
- size
- mergeUpdate (true if should update by collecting updates and merging back)
- randomAccessCost
- sequentialAccessCost
- randomUpdateCost

*Matrix qualities*

- rowMajor (true if adjacent elements in rows are near in memory)
- columnMajor (true if adjacent elements in columns are near in memory)
- randomOrder (true if adjacent elements are not near in memory)
- sequentialAccessCost (estimated cost of access in major order including
some estimate of caching effects)
- randomAccessCost
- mergeUpdate

*Functional properties*

(all as supported in Mahout now)

*Primitive logical operators*

(many of these are already supported in Mahout.  Some additions would be
required.)

- applyElementWise (apply a unary function to a matrix or vector)
- applyElementWise (apply a binary function to two matrixes or two vectors)
- applyElementsToRows (apply a binary function to elements of a vector and
corresponding rows of a matrix)
- applyElementsToColumns (apply a binary function to elements of a vector
and corresponding columns of a matrix)
- outerProduct (apply a binary function to all combinations of elements of
two vectors)
- applyRowWise (apply a vector function per row to get rows of a result
matrix)
- applyRowWise (apply a vector function per row to get values of a result
vector)
- applyRowWise (apply a vector function per row to get a list of results)
- applyColumnWise (apply a vector function per row to get rows of a result
matrix)
- applyColumnWise (apply a vector function per row to get values of a
result vector)
- applyColumnWise (apply a vector function per row to get a list of results)
- generalMatrixProduct (apply matrix product defined by two functions to
two matrices)
- generalMatrixProduct (apply matrix product defined by two functions to a
matrix and a vector)
- generalMatrixProduct (apply matrix product defined by two functions to a
vector and a matrix)
- transpose (of a matrix)

The physical operators will probably need to have a large number of
specializations of these, particularly the generalMatrixProduct functionals.

The optimizer should be able to estimate the run-time of various physical
plans at least down to ordering the different plans relatively correctly.

*Open questions* I have include:

- Is there a really good way to represent the physical plans such that
reduction to an execution plan is relatively easy?

- Can we have a code generator for the execution plan to avoid coding
dozens to hundreds of primitive executors?

- Is this enough to get good run time estimations?

- Is there a clever way we can gather sample run-times so that we can use
machine learning to learn the cost function?

Re: Talking about optimization

Posted by Ted Dunning <te...@gmail.com>.

Chris,

You did drop into the middle of a discussion, but it still sounds like you
dropped into the right discussion.

The context here is this:

- we actually often do have sparse-sparse operations.  These are, as you
say, channel limited rather than CPU limited

- we have interesting combinations of operations such as A'A or A'B or
elementwise operations of various sorts which can be computed in many
different ways.  Some of the matrices have strong row-major preference,
some are transposed row-major stuff and thus are very column major
preferent.  Some of the matrices have good sequential access, but terrible
random access, others have great random access, but terrible cache
properties.

- even dense matrices have very strong access pattern preferences for
obvious reasons

- our math system is becoming a two level system.  On the upper level, we
do lazy evaluation and some simple cost based heuristic rearrangement to
improve performance.  On the lower level, the large scale operations are
set, but we get lots of mileage out of a cost based optimization that picks
the looping pattern to use.  This optimizer looks properties of the vectors
in question as well as of the functions being executed and decides what it
can do.

Our current discussion is about how we can improve the large matrix
operations at the lower level.  This is prompted by the discovery that one
of our sparse matrices was doing an abysmal job of optimizing certain
products (i.e. it was devolving to full dense operation).  I put a quick
hack in place to avoid that, but it would be good to get something better.
 We would get bonus points if the physical operations that we expose at the
lower level are expressive enough to allow the higher level to do a better
job of large scale optimization.


This connects pretty closely to your comments in a few areas.  Where you
are talking about communication minimization, read high level optimizer.
 Were you talk about cache friendliness, read our low level optimizer.  The
big difference is that we really do care about sparse-sparse cases and
don't have a way to reduce to sparse-dense cases.

In a few cases, we might have matrices that have skewed sparsity such that
they could be row and column permuted into a very small dense kernel
adjoined to the right and below with relatively sparse skinny matrices and
diagonally down and to the right with a superbly sparse matrix.  Whether or
not this really helps is an open question and we have often depended on the
near randomness of our native arrangement to argue that we won't have much
imbalance in the parallel execution.  Many times this is even true.



So does that catch you up?







On Mon, Jun 9, 2014 at 4:45 PM, Chris Baker <cg...@gmail.com> wrote:

> I just joined the mailing lists and have apparently dropped in the middle
> of the conversation, so excuse me if I'm repeating stuff that's already
> been said and/or isn't useful.
>
> Sparse mat-vec is, in general, a bandwidth-limited operation. It has very
> low arithmetic density and is, in general, a waste of a good CPU. I'll
> assume dense operand vector, since that is the use case for most algorithms
> (and since it's what I'm most familiar with ;).
>
> The best options for improving the performance are to improve the memory
> access patterns through the operand vector (assuming SRM). One of the
> simplest way to do that is to turn your sparse matrix into a block matrix
> and then rely on efficient local/dense matvec to make up for the zero
> elements that you undoubtedly introduced; this generally requires
> reordering the matrix to expose these blocks. If the analysis required for
> this isn't itself parallel, then you've traded your embarrassingly parallel
> (albeit, computationally dense) for a largely serial operation; you'd
> better be using a lot of times if you want to amortize that cost.
>
> There is a whole class of communication avoiding algorithms that seek to
> avoid this off-die and off-node data transfer, for example, by applying the
> matrix in powers (exploit a little extra transfer to perform A^3, A^2 and A
> at the same time and perform three steps of, e.g., Lanczos, while paying
> for the latency of only one). However, the performance of these is still
> very dependent on the sparsity structure of the matrix, and the algorithms
> are significantly more difficult and numerically sensitive.
>
> Block algorithms (those that apply the matrix to multiple operand vectors
> at the same time) are also useful, because they allow for re-use of the
> sparsity structure. Course, that's a chicken-egg issue; there's limited
> incentive to write block algorithms without a block sparse mat-vec, and
> limited reason to write a blocked mat-vec without the indication that folks
> will write algorithms.
>
>
> On Mon, Jun 9, 2014 at 7:20 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > no argument against it, except who's going to do e2e local math optimizer
> > and how long it will take. maybe, both is possible -- faster local gains
> we
> > know current algorithms need, and longer term overhaul that included e2e
> > in-core plans.
> >
> >
> > On Mon, Jun 9, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > > On Mon, Jun 9, 2014 at 4:12 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > One open question i had was,well, since the outlined approach avoids
> > > > logical plans still for in-core matrix, it would be difficult to
> > optimize
> > > > the expressions in this context (i.e. figure out best-working
> _output_
> > > > structure).
> > > >
> > >
> > > This is the single biggest flaw that I see.  To do higher level
> > > optimizations, the lower levels have to expose costs for possible
> plans.
> > >
> > > Taking a major whack at the local cases for element-wise and matrix
> > > multiply is still a good thing to do.
> > >
> >
>

Re: Talking about optimization

Posted by Chris Baker <cg...@gmail.com>.

I just joined the mailing lists and have apparently dropped in the middle
of the conversation, so excuse me if I'm repeating stuff that's already
been said and/or isn't useful.

Sparse mat-vec is, in general, a bandwidth-limited operation. It has very
low arithmetic density and is, in general, a waste of a good CPU. I'll
assume dense operand vector, since that is the use case for most algorithms
(and since it's what I'm most familiar with ;).

The best options for improving the performance are to improve the memory
access patterns through the operand vector (assuming SRM). One of the
simplest way to do that is to turn your sparse matrix into a block matrix
and then rely on efficient local/dense matvec to make up for the zero
elements that you undoubtedly introduced; this generally requires
reordering the matrix to expose these blocks. If the analysis required for
this isn't itself parallel, then you've traded your embarrassingly parallel
(albeit, computationally dense) for a largely serial operation; you'd
better be using a lot of times if you want to amortize that cost.

There is a whole class of communication avoiding algorithms that seek to
avoid this off-die and off-node data transfer, for example, by applying the
matrix in powers (exploit a little extra transfer to perform A^3, A^2 and A
at the same time and perform three steps of, e.g., Lanczos, while paying
for the latency of only one). However, the performance of these is still
very dependent on the sparsity structure of the matrix, and the algorithms
are significantly more difficult and numerically sensitive.

Block algorithms (those that apply the matrix to multiple operand vectors
at the same time) are also useful, because they allow for re-use of the
sparsity structure. Course, that's a chicken-egg issue; there's limited
incentive to write block algorithms without a block sparse mat-vec, and
limited reason to write a blocked mat-vec without the indication that folks
will write algorithms.

On Mon, Jun 9, 2014 at 7:20 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> no argument against it, except who's going to do e2e local math optimizer
> and how long it will take. maybe, both is possible -- faster local gains we
> know current algorithms need, and longer term overhaul that included e2e
> in-core plans.
>
>
> On Mon, Jun 9, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > On Mon, Jun 9, 2014 at 4:12 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > One open question i had was,well, since the outlined approach avoids
> > > logical plans still for in-core matrix, it would be difficult to
> optimize
> > > the expressions in this context (i.e. figure out best-working _output_
> > > structure).
> > >
> >
> > This is the single biggest flaw that I see.  To do higher level
> > optimizations, the lower levels have to expose costs for possible plans.
> >
> > Taking a major whack at the local cases for element-wise and matrix
> > multiply is still a good thing to do.
> >
>

Re: Talking about optimization

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

no argument against it, except who's going to do e2e local math optimizer
and how long it will take. maybe, both is possible -- faster local gains we
know current algorithms need, and longer term overhaul that included e2e
in-core plans.

On Mon, Jun 9, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, Jun 9, 2014 at 4:12 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > One open question i had was,well, since the outlined approach avoids
> > logical plans still for in-core matrix, it would be difficult to optimize
> > the expressions in this context (i.e. figure out best-working _output_
> > structure).
> >
>
> This is the single biggest flaw that I see.  To do higher level
> optimizations, the lower levels have to expose costs for possible plans.
>
> Taking a major whack at the local cases for element-wise and matrix
> multiply is still a good thing to do.
>

Re: Talking about optimization

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jun 9, 2014 at 4:12 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> One open question i had was,well, since the outlined approach avoids
> logical plans still for in-core matrix, it would be difficult to optimize
> the expressions in this context (i.e. figure out best-working _output_
> structure).
>

This is the single biggest flaw that I see.  To do higher level
optimizations, the lower levels have to expose costs for possible plans.

Taking a major whack at the local cases for element-wise and matrix
multiply is still a good thing to do.

Re: Talking about optimization

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I was more in favor of more pragmatic approach. Fix is as you go, rather
than do all-cases-at-once.

In Teds last patch we see absolutely fine beginning of that in terms that
it introduces 3 separate algorithms for various cases where one operand is
always an SRM (or, symmetrically, a SCM), and another operand is one of
dense, sparse, SRM/SCM.

What it lacks, however, is to connect those 3 algorithms to permutations of
all of the above, and their transposed view.

I was thinking about a simple patch along the lines that would add
directional primitives isRowWise, isColumnWise on top of what we already
have (so transpose view would just flip those responses), and then build a
times algorightm which is single for pretty much... everything. I am not
even sure that we need cost interrogations, since once we figure proper
directionality, we'd be able to either rely on vector-vector stuff to sort
the cost-based algos, or, whenever we cannot operate with vectors directly,
we could make certain cost assumptions based on previous methods as well.
(e.g. if it is neither row-wise nor column-wise and not dense, then it is
ok to assume open hash costs).

so this is for times only. I think a tiny bit of thinking while
implementing will prove that the elementwise operations can also make
similar cost assumptions.

I am just afraid of work _too big_ to reasonably succeed. And soon.

One open question i had was,well, since the outlined approach avoids
logical plans still for in-core matrix, it would be difficult to optimize
the expressions in this context (i.e. figure out best-working _output_
structure).

On Mon, Jun 9, 2014 at 3:39 PM, Ted Dunning <te...@gmail.com> wrote:

> So D's disparagement of my half-assed fix for our egregrious
> mis-performance has lead me to thinking about what a real optimizer would
> do.
>
> I think we all agree that we are attempting to maximize performance for a
> flow of operations given pre-existing layouts.  Transformations include
> combining and rearranging operations and using sparse iterations where
> applicable.  I expect that optimizing for intermediate memory use will
> happen as a side effect.
>
> As I see it, we have some properties of the matrices and vectors and
> functions that are our inputs that we could use.  We also have a relatively
> small set of higher order functions that include the primitives that we
> already have and a few extras.
>
> These include at least
>
> *Vector qualities*
>
> - isDense
> - sparsity (fraction of non-zeros)
> - skewedSparse (+1 if the non-zeros are near the beginning of a vector, -1
> if near the end, 0 if roughly symmetric)
> - size
> - mergeUpdate (true if should update by collecting updates and merging
> back)
> - randomAccessCost
> - sequentialAccessCost
> - randomUpdateCost
>
> *Matrix qualities*
>
> - rowMajor (true if adjacent elements in rows are near in memory)
> - columnMajor (true if adjacent elements in columns are near in memory)
> - randomOrder (true if adjacent elements are not near in memory)
> - sequentialAccessCost (estimated cost of access in major order including
> some estimate of caching effects)
> - randomAccessCost
> - mergeUpdate
>
> *Functional properties*
>
> (all as supported in Mahout now)
>
> *Primitive logical operators*
>
> (many of these are already supported in Mahout.  Some additions would be
> required.)
>
> - applyElementWise (apply a unary function to a matrix or vector)
> - applyElementWise (apply a binary function to two matrixes or two vectors)
> - applyElementsToRows (apply a binary function to elements of a vector and
> corresponding rows of a matrix)
> - applyElementsToColumns (apply a binary function to elements of a vector
> and corresponding columns of a matrix)
> - outerProduct (apply a binary function to all combinations of elements of
> two vectors)
> - applyRowWise (apply a vector function per row to get rows of a result
> matrix)
> - applyRowWise (apply a vector function per row to get values of a result
> vector)
> - applyRowWise (apply a vector function per row to get a list of results)
> - applyColumnWise (apply a vector function per row to get rows of a result
> matrix)
> - applyColumnWise (apply a vector function per row to get values of a
> result vector)
> - applyColumnWise (apply a vector function per row to get a list of
> results)
> - generalMatrixProduct (apply matrix product defined by two functions to
> two matrices)
> - generalMatrixProduct (apply matrix product defined by two functions to a
> matrix and a vector)
> - generalMatrixProduct (apply matrix product defined by two functions to a
> vector and a matrix)
> - transpose (of a matrix)
>
> The physical operators will probably need to have a large number of
> specializations of these, particularly the generalMatrixProduct
> functionals.
>
> The optimizer should be able to estimate the run-time of various physical
> plans at least down to ordering the different plans relatively correctly.
>
> *Open questions* I have include:
>
> - Is there a really good way to represent the physical plans such that
> reduction to an execution plan is relatively easy?
>
> - Can we have a code generator for the execution plan to avoid coding
> dozens to hundreds of primitive executors?
>
> - Is this enough to get good run time estimations?
>
> - Is there a clever way we can gather sample run-times so that we can use
> machine learning to learn the cost function?
>