You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/06/23 21:36:31 UTC

MAHOUT-77 speedups

I'm doing some more profiling and it seems like all over the place we  
are looping from i = 0; i < size() and calling getQuick.  This is  
then, for SparseVector, dominated by the find lookup, which is taking  
up a lot of time.

So, it's not that find() is slow, but the fact that we are needlessly  
calling getQuick() for loops.

Just an FYI as we go forward not to do that.

-Grant

Re: MAHOUT-77 speedups

Posted by Grant Ingersoll <gs...@apache.org>.



On Jun 23, 2009, at 7:12 PM, Ted Dunning wrote:

> I would say that the iterator that makes sense for all vectors  
> should be
> default, if there is any default.
>
> Option 1: no default, user must specify which they want (avoids
> assumptions), Vector is no longer an iterable, but has two methods  
> that
> return iterables.

+1.  I'll work this up as part of M-77 patch I have going.

Re: MAHOUT-77 speedups

Posted by Ted Dunning <te...@gmail.com>.

I would say that the iterator that makes sense for all vectors should be
default, if there is any default.

Option 1: no default, user must specify which they want (avoids
assumptions), Vector is no longer an iterable, but has two methods that
return iterables.

Option 2: default is all elements, user can select non-zero iterator

Option 3: default is non-zero, user can select all element iterator

I think that option 2 would surprise users of sparse vectors and 3 would
surprise dense vector users.  We have proof of the surprise in (2).

My vote would be (1) because it is clear what is happening.

On Tue, Jun 23, 2009 at 4:07 PM, Sean Owen <sr...@gmail.com> wrote:

> Sounds good. Which should be the default (returned from iterator())?
> non-zero?
>
>
> On Tue, Jun 23, 2009 at 6:53 PM, Ted Dunning<te...@gmail.com> wrote:
> > I disagree.
> >
> > There are two common use cases: iterateAll and iterateNonZero
> >
> > One example of where iterateAll is different is where I want to get sum_i
> > |x_i - z|  (as in L_1 clustering).  I can't iterate over just non-zero
> > elements of x_i unless z is zero (which it is not, in general).
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: MAHOUT-77 speedups

Posted by Sean Owen <sr...@gmail.com>.

Sounds good. Which should be the default (returned from iterator())? non-zero?


On Tue, Jun 23, 2009 at 6:53 PM, Ted Dunning<te...@gmail.com> wrote:
> I disagree.
>
> There are two common use cases: iterateAll and iterateNonZero
>
> One example of where iterateAll is different is where I want to get sum_i
> |x_i - z|  (as in L_1 clustering).  I can't iterate over just non-zero
> elements of x_i unless z is zero (which it is not, in general).

Re: MAHOUT-77 speedups

Posted by Ted Dunning <te...@gmail.com>.

I disagree.

There are two common use cases: iterateAll and iterateNonZero

One example of where iterateAll is different is where I want to get sum_i
|x_i - z|  (as in L_1 clustering).  I can't iterate over just non-zero
elements of x_i unless z is zero (which it is not, in general).

On Tue, Jun 23, 2009 at 3:23 PM, Sean Owen <sr...@gmail.com> wrote:

> Until use cases demonstrate otherwise, I suggest we pick one contract
> and implement it rather than several. I think the most reasonable use
> case, given what I see so far, is "non zero elements". What say?
>
> On Tue, Jun 23, 2009 at 6:21 PM, Ted Dunning<te...@gmail.com> wrote:
> > Should there be a method that returns different sets of elements?
> > (allElements(), nonZeroElements(), filteredElements(Condition)).
>  Whatever
> > the contract, it should definitely be stated and should definitely work
> the
> > same for sparse or non-sparse matrices.
>

Re: MAHOUT-77 speedups

Posted by Sean Owen <sr...@gmail.com>.

Until use cases demonstrate otherwise, I suggest we pick one contract
and implement it rather than several. I think the most reasonable use
case, given what I see so far, is "non zero elements". What say?

On Tue, Jun 23, 2009 at 6:21 PM, Ted Dunning<te...@gmail.com> wrote:
> Should there be a method that returns different sets of elements?
> (allElements(), nonZeroElements(), filteredElements(Condition)).  Whatever
> the contract, it should definitely be stated and should definitely work the
> same for sparse or non-sparse matrices.

Re: MAHOUT-77 speedups

Posted by Ted Dunning <te...@gmail.com>.

Should there be a method that returns different sets of elements?
(allElements(), nonZeroElements(), filteredElements(Condition)).  Whatever
the contract, it should definitely be stated and should definitely work the
same for sparse or non-sparse matrices.

The only two reasonable options that I see are:

a) iteration through vector themselves is in-order, hitting all elements.  A
special method returns an iterator through non-zeros or through all elements
not equal to some specified value or satisfying some condition.  Iterating
through the non-zeros should work identically for dense and sparse matrices.

b) iteration through a vector is "something" and is different for all types,
but generally is something that we think would be convenient.  Special
methods return iterators with defined semantics.

Obviously from the sarcastic tone of (b), I prefer (a)

In-order iteration is definitely hard to do with many implementations.  I
don't know of any algorithms that really require in-order traversal for
sparse matrices/vectors.  For dense algorithms, ordering is often important
(imagine back-substitution in an LU solver).

On Tue, Jun 23, 2009 at 2:10 PM, Sean Owen <sr...@gmail.com> wrote:

> You bet, Vector implements Iterable<Vector.Element>!
>
> In the case of SparseVector it only hits the nonzero elements.
>
> I wonder if we can tighten up the contract of Vector in this respect:
> - can it really not guarantee in-order iteration? this seems intuitive
> and maybe useful
> - should state it may not (will not?) hit elements that are zero?
>
>

Re: MAHOUT-77 speedups

Posted by Grant Ingersoll <gs...@apache.org>.

Do we have to do new Element() or could we have a contract whereby we  
document that users must clone the element first?  that way we could  
have one element per iterator that get's reused.


On Jun 23, 2009, at 5:40 PM, Grant Ingersoll wrote:

> Seems like we should use the Iterator in most places and focus on  
> making it fast.  It is likely that we could make some optimizations  
> to the Iterator to take better advantage of the underlying  
> structures, instead of relying on the Abstract version.
>
>
> On Jun 23, 2009, at 5:10 PM, Sean Owen wrote:
>
>> You bet, Vector implements Iterable<Vector.Element>!
>>
>> In the case of SparseVector it only hits the nonzero elements.
>>
>> I wonder if we can tighten up the contract of Vector in this respect:
>> - can it really not guarantee in-order iteration? this seems  
>> intuitive
>> and maybe useful
>> - should state it may not (will not?) hit elements that are zero?
>>
>> On Tue, Jun 23, 2009 at 5:00 PM, Ted Dunning<te...@gmail.com>  
>> wrote:
>>> Do we have a good way to iterate over non-zero elements?
>>>
>>> On Tue, Jun 23, 2009 at 12:36 PM, Grant Ingersoll <gsingers@apache.org 
>>> >wrote:
>>>
>>>> I'm doing some more profiling and it seems like all over the  
>>>> place we are
>>>> looping from i = 0; i < size() and calling getQuick.  This is  
>>>> then, for
>>>> SparseVector, dominated by the find lookup, which is taking up a  
>>>> lot of
>>>> time.
>>>>
>>>> So, it's not that find() is slow, but the fact that we are  
>>>> needlessly
>>>> calling getQuick() for loops.
>>>>
>>>> Just an FYI as we go forward not to do that.
>>>>
>>>>
>>>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: MAHOUT-77 speedups

Posted by Grant Ingersoll <gs...@apache.org>.

Seems like we should use the Iterator in most places and focus on  
making it fast.  It is likely that we could make some optimizations to  
the Iterator to take better advantage of the underlying structures,  
instead of relying on the Abstract version.


On Jun 23, 2009, at 5:10 PM, Sean Owen wrote:

> You bet, Vector implements Iterable<Vector.Element>!
>
> In the case of SparseVector it only hits the nonzero elements.
>
> I wonder if we can tighten up the contract of Vector in this respect:
> - can it really not guarantee in-order iteration? this seems intuitive
> and maybe useful
> - should state it may not (will not?) hit elements that are zero?
>
> On Tue, Jun 23, 2009 at 5:00 PM, Ted Dunning<te...@gmail.com>  
> wrote:
>> Do we have a good way to iterate over non-zero elements?
>>
>> On Tue, Jun 23, 2009 at 12:36 PM, Grant Ingersoll <gsingers@apache.org 
>> >wrote:
>>
>>> I'm doing some more profiling and it seems like all over the place  
>>> we are
>>> looping from i = 0; i < size() and calling getQuick.  This is  
>>> then, for
>>> SparseVector, dominated by the find lookup, which is taking up a  
>>> lot of
>>> time.
>>>
>>> So, it's not that find() is slow, but the fact that we are  
>>> needlessly
>>> calling getQuick() for loops.
>>>
>>> Just an FYI as we go forward not to do that.
>>>
>>>
>>

Re: MAHOUT-77 speedups

Posted by Sean Owen <sr...@gmail.com>.

You bet, Vector implements Iterable<Vector.Element>!

In the case of SparseVector it only hits the nonzero elements.

I wonder if we can tighten up the contract of Vector in this respect:
- can it really not guarantee in-order iteration? this seems intuitive
and maybe useful
- should state it may not (will not?) hit elements that are zero?

On Tue, Jun 23, 2009 at 5:00 PM, Ted Dunning<te...@gmail.com> wrote:
> Do we have a good way to iterate over non-zero elements?
>
> On Tue, Jun 23, 2009 at 12:36 PM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> I'm doing some more profiling and it seems like all over the place we are
>> looping from i = 0; i < size() and calling getQuick.  This is then, for
>> SparseVector, dominated by the find lookup, which is taking up a lot of
>> time.
>>
>> So, it's not that find() is slow, but the fact that we are needlessly
>> calling getQuick() for loops.
>>
>> Just an FYI as we go forward not to do that.
>>
>>
>

Re: MAHOUT-77 speedups

Posted by Ted Dunning <te...@gmail.com>.

Do we have a good way to iterate over non-zero elements?

On Tue, Jun 23, 2009 at 12:36 PM, Grant Ingersoll <gs...@apache.org>wrote:

> I'm doing some more profiling and it seems like all over the place we are
> looping from i = 0; i < size() and calling getQuick.  This is then, for
> SparseVector, dominated by the find lookup, which is taking up a lot of
> time.
>
> So, it's not that find() is slow, but the fact that we are needlessly
> calling getQuick() for loops.
>
> Just an FYI as we go forward not to do that.
>
>