You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Dan Filimon <da...@gmail.com> on 2013/04/10 23:56:14 UTC

Horrible performance with SequentialAccessSparseVector

Good news and bad news!

After weeks of searching, I finally found the bug that made my code run so
slowly.
I'm using SequenceFileDirIterable to read the centroids from disk and it
looks like it's creating SequentialAccessSparseVectors.

Then, when assigning each point to its appropriate vector, we update the
centroid itself using assign() and a custom function that does a weighted
sum.
Herein lies the problem.

In the existing code, assign() comes from AbstractVector and if the
function is not PLUS or PLUS_ABS, it does this:

for (int i = 0; i < size; i++) {
  setQuick(i, function.apply(getQuick(i), other.getQuick(i)));
}

If "this" is a SequentialAccessSparseVector, getting is done in O(log size)
steps as is setting (except when a new element needs to be inserted which
can cause the two arrays in OrderedIntDoubleMapping to grow, which means a
copy of the vector has to be made).

Even assuming copying is not an issue (although it really is, the centroids
grow more dense as more points are added), that is O(size log size) for one
assignment where size is the dimension of the vector (NOT the number of
mappings).

In my case with vectors with 200K dimensions out of which at best 100 are
non-zero this was VERY painful.

Anyway, here's the link to an experimental patch [1] (it adds NO more
tests/benchmarks).
And here's the link to the issue [2].

What should we do?

[1] https://reviews.apache.org/r/10409/
[2] https://issues.apache.org/jira/browse/MAHOUT-1190#comment-13628314

Re: Horrible performance with SequentialAccessSparseVector

Posted by Ted Dunning <te...@gmail.com>.
Which code is this in?

If you are reading centroids from a file, it seems that you are just
classifying points which should not be done with updates.

If you are reading points from a file and creating/updating centroids, then
the centroids should be created as dense vectors.

Either way, I don't see how you have this problem.  Clearly you do,
however, so my understanding of the goal is probably missing.


On Wed, Apr 10, 2013 at 2:56 PM, Dan Filimon <da...@gmail.com>wrote:

> Then, when assigning each point to its appropriate vector, we update the
> centroid itself using assign() and a custom function that does a weighted
> sum.
>

Re: Horrible performance with SequentialAccessSparseVector

Posted by Jake Mannix <ja...@gmail.com>.
SequentialAccessSparseVector should really *never* be used in a
mutating way.  You should use RandomAccessSparseVector if you're
going to mutate, and then *freeze* the results in a SASV when you're
done mutating it and you expect to be using it for only dot() and other
read-only operations which scan over the whole set of nonzero entries.

This should be documented better, perhaps.


On Wed, Apr 10, 2013 at 2:56 PM, Dan Filimon <da...@gmail.com>wrote:

> Good news and bad news!
>
> After weeks of searching, I finally found the bug that made my code run so
> slowly.
> I'm using SequenceFileDirIterable to read the centroids from disk and it
> looks like it's creating SequentialAccessSparseVectors.
>
> Then, when assigning each point to its appropriate vector, we update the
> centroid itself using assign() and a custom function that does a weighted
> sum.
> Herein lies the problem.
>
> In the existing code, assign() comes from AbstractVector and if the
> function is not PLUS or PLUS_ABS, it does this:
>
> for (int i = 0; i < size; i++) {
>   setQuick(i, function.apply(getQuick(i), other.getQuick(i)));
> }
>
> If "this" is a SequentialAccessSparseVector, getting is done in O(log size)
> steps as is setting (except when a new element needs to be inserted which
> can cause the two arrays in OrderedIntDoubleMapping to grow, which means a
> copy of the vector has to be made).
>
> Even assuming copying is not an issue (although it really is, the centroids
> grow more dense as more points are added), that is O(size log size) for one
> assignment where size is the dimension of the vector (NOT the number of
> mappings).
>
> In my case with vectors with 200K dimensions out of which at best 100 are
> non-zero this was VERY painful.
>
> Anyway, here's the link to an experimental patch [1] (it adds NO more
> tests/benchmarks).
> And here's the link to the issue [2].
>
> What should we do?
>
> [1] https://reviews.apache.org/r/10409/
> [2] https://issues.apache.org/jira/browse/MAHOUT-1190#comment-13628314
>



-- 

  -jake

Re: Horrible performance with SequentialAccessSparseVector

Posted by Jake Mannix <ja...@gmail.com>.
> In the existing code, assign() comes from AbstractVector and if the
> function is not PLUS or PLUS_ABS, it does this:
>
> for (int i = 0; i < size; i++) {
>  setQuick(i, function.apply(getQuick(i), other.getQuick(i)));
> }

Yeah, this has been a nasty nasty fact forever, and I should read your
patch more carefully, it looks like you're doing the "if the function
preserves zero, apply it sparsely" trick, which if done right, is great.




On Wed, Apr 10, 2013 at 2:56 PM, Dan Filimon <da...@gmail.com>wrote:

> Good news and bad news!
>
> After weeks of searching, I finally found the bug that made my code run so
> slowly.
> I'm using SequenceFileDirIterable to read the centroids from disk and it
> looks like it's creating SequentialAccessSparseVectors.
>
> Then, when assigning each point to its appropriate vector, we update the
> centroid itself using assign() and a custom function that does a weighted
> sum.
> Herein lies the problem.
>
> In the existing code, assign() comes from AbstractVector and if the
> function is not PLUS or PLUS_ABS, it does this:
>
> for (int i = 0; i < size; i++) {
>   setQuick(i, function.apply(getQuick(i), other.getQuick(i)));
> }
>
> If "this" is a SequentialAccessSparseVector, getting is done in O(log size)
> steps as is setting (except when a new element needs to be inserted which
> can cause the two arrays in OrderedIntDoubleMapping to grow, which means a
> copy of the vector has to be made).
>
> Even assuming copying is not an issue (although it really is, the centroids
> grow more dense as more points are added), that is O(size log size) for one
> assignment where size is the dimension of the vector (NOT the number of
> mappings).
>
> In my case with vectors with 200K dimensions out of which at best 100 are
> non-zero this was VERY painful.
>
> Anyway, here's the link to an experimental patch [1] (it adds NO more
> tests/benchmarks).
> And here's the link to the issue [2].
>
> What should we do?
>
> [1] https://reviews.apache.org/r/10409/
> [2] https://issues.apache.org/jira/browse/MAHOUT-1190#comment-13628314
>



-- 

  -jake