You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/04/15 16:16:52 UTC

Having some trouble with SequentialAccessSparseVector.DenseVector

Along the way to a patch for MAHOUT-379, I'm having some trouble
figuring out SequentialAccessSparseVector.DenseVector. I think it can
be simplified, but unless I'm misunderstanding there are several bugs
here. I'd like to find my mistake or else simplify/fix this along the
way.

get() uses offset and index, both initialized to the 'ind' argument,
which is an index into the vector, not an offset into the mapping
array. But then "cur = indices[index]" is compared against offset,
which seems incorrect.

On a related note in the Iterator's next() method we have

      element.offset = offset++;

But this makes the two values forever inequal by one, which doesn't
sound right? they start off equal at 0.

I could be mistaken, but I also wonder whether this can be
dramatically simplified by a single Element implementation which both
understands its logical index into the vector, and offset into the
mapping array if applicable.

Re: Having some trouble with SequentialAccessSparseVector.DenseVector

Posted by Sean Owen <sr...@gmail.com>.
Actually it does all work. I wrote some tests that verify it. I think
my first question about index and cur works out because both are set
to 0 -- and 0 is correct as the starting value of an array offset and
index. And in the other case I believe it's intended that the two
values are the current and next values, respectively.

Nevertheless let me post a patch that rearranges these
Iterator/Element implementations in a way that is perhaps more
consistent across the three classes in naming and structure. I think
there are a few opportunities for streamlining too -- for example
these static inner classes are taking a reference to the enclosing
class, when making them non-static gets you that for free. (There's
still a reference there, just synthetic.)

Re: Having some trouble with SequentialAccessSparseVector.DenseVector

Posted by Jake Mannix <ja...@gmail.com>.
Hey Sean,


On Thu, Apr 15, 2010 at 7:16 AM, Sean Owen <sr...@gmail.com> wrote:

> Along the way to a patch for MAHOUT-379, I'm having some trouble
> figuring out SequentialAccessSparseVector.DenseVector. I think it can
> be simplified, but unless I'm misunderstanding there are several bugs
> here. I'd like to find my mistake or else simplify/fix this along the
> way.
>

  You mean SequentialAccessSparseVector.DenseElement, right?

  I apologize for some of those iterator element crazy loops.  They may
be simplifyable.  I'd be surprised if the logic is really *wrong*, because
there are too many places where we'd break if so.  Although, we very
rarely iterateAll() on sparse vectors, so I guess it's possible this has
bugs in it still.

get() uses offset and index, both initialized to the 'ind' argument,
> which is an index into the vector, not an offset into the mapping
> array. But then "cur = indices[index]" is compared against offset,
> which seems incorrect.
>

Oof.  That code is hard to read.  I'll have to look at it closer and
see what the hell I was thinking.


> On a related note in the Iterator's next() method we have
>
>      element.offset = offset++;
>
> But this makes the two values forever inequal by one, which doesn't
> sound right? they start off equal at 0.
>

That seems pretty clear.  If that line can at all be correct it would
have to be ++offset, not offset++.


> I could be mistaken, but I also wonder whether this can be
> dramatically simplified by a single Element implementation which both
> understands its logical index into the vector, and offset into the
> mapping array if applicable.
>

If you can dramatically simplify it in such a way that it works with
both dense and sparse iteration, and can handle the ugly case
of Element.set(double) (which completely changes the index and
value arrays by possibly shifting them over one), by all means.

Is there a simple unit test we can add to see if it's truly broken
currently?  It seems like it should be easy to see, if we
exercise dense iteration on SequentialAccessSparseVector impls,
possibly ones which have a nonzero value for get(0).

Speaking of which, I notice that indeed, we don't ever mess with
iterateAll().  A really simple unit test should show that is totally
broken.

Hmm... I tried changing TestSparseVector.testIterator() to test
SeqAcc instead of RandAcc, and it still passed, so it's not
a strict enough test, I guess...

  -jake