You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Dan Filimon <da...@gmail.com> on 2013/04/12 18:36:59 UTC

Odd vector iteration behavior

While looking at the patch for fixing the sparse vectors (MAHOUT-1190), I
started working with vector Iterators doing what I thought was reasonable.

This is the important snippet:
[...]
        thisIterator = this.iterateNonZero();
        thatIterator = other.iterateNonZero();
        thisElement = thatElement = null;
        boolean advanceThis = true;
        boolean advanceThat = true;
        OrderedIntDoubleMapping thisUpdates = new OrderedIntDoubleMapping();

        while (thisIterator.hasNext() && thatIterator.hasNext()) {
          if (advanceThis) {
            thisElement = thisIterator.next();
          }
          if (advanceThat) {
            thatElement = thatIterator.next();
          }
[... advanceThis and advanceThat are set to true based on which iterator to
advance...]

The problem here is that when calling next(), the iterator state gets
invalidated and when calling hasNext() the iterator will be advanced
accordingly and the element references will point to the next element
(which is mutated).

So, if the indices start at:
52 and 87
despite wanting to only advance the 52, since both were accessed with
next(), they are both modified.

Here's another snippet with this behavior [1]:

    Vector vector = new SequentialAccessSparseVector(100);
    vector.set(0, 1);
    vector.set(2, 2);
    vector.set(4, 3);
    vector.set(6, 4);
    Iterator<Vector.Element> vectorIterator = vector.iterateNonZero();
    Vector.Element element = null;
    int i = 0;
    while (vectorIterator.hasNext()) {
      if (i % 2 == 0) {
        element = vectorIterator.next();
      }
      System.out.printf("%d %d %f\n", i, element.index(), element.get());
      ++i;
    }


The output is:
0 0 1.000000
1 2 2.000000
2 2 2.000000
3 4 3.000000
4 4 3.000000
5 6 4.000000
6 6 4.000000

I expected it to be:
0 0 1.000000
1 0 1.000000
2 2 2.000000
3 2 2.000000
4 4 3.000000
5 4 3.000000
6 6 4.000000

So, I'm completely wrong. Is this just me not understanding what an
iterator is supposed to do?

[1] https://gist.github.com/dfilimon/5373271

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
On Fri, Apr 12, 2013 at 11:49 AM, Ted Dunning <te...@gmail.com> wrote:

> (I suspect that the bright spark who did this in the first place was me so
> I can be rude)


I take it back.  The change to guava iterators was part of the Sean-job
that ran on 4/10/2011 as part of MAHOUT-611.

Sean, this probably affects Myrrix code as well.

Re: Odd vector iteration behavior

Posted by Dan Filimon <da...@gmail.com>.
One more thing to point out – whatever fix we decide on using needs to
happen for *all* major vector iterator implementations –
SequentialAccessSparseVector, RandomAccessSparseVector, DenseVector.

Also, it looks like (at least) the following classes also have their custom
iterators: FileBasedSparseBinaryMatrix.SparseBinaryVector,
MatrixVectorView, PermutedVectorView, VectorView.

Funnily enough, FileBasedSparseBinaryMatrix.SparseBinaryVector and
MatrixVectorView do mention the need to clone the returned element.

Also, PermutedVectorView creates a new Element and VectorView creates a new
DecoratorElement before returning it.

These different classes do different things. Now that we're talking about
the right thing to do, it should apply to all Vector classes, right?



On Sat, Apr 13, 2013 at 1:07 AM, Jake Mannix <ja...@gmail.com> wrote:

> I think requiring the caller to know to copy/clone the element to be
> allowed
> to call hasNext() multiple times is extremely non-intuitive.  Having the
> caller
> know that it's dangerous / not allowed to hang onto an element without
> copying while continuing to iterate (e.g. when looking for the "largest"
> element)
> is fine, as this is used all over this project, as well as being the
> general contract
> with Writables in Hadoop.
>
> I think the most straightforward fix is Ted's b) solution.  Continue with
> re-use,
> but make sure that hasNext() is side-effect free by dropping
> AbstractIterator
> as our impl here.
>
>
> On Fri, Apr 12, 2013 at 12:37 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > Yeah you have to have a clone() or copy constructor or something. Add
> > that? that solves the other problem. At least, that's the intent and
> > it is an option.
> >
> > If you need the value to be long-lived, one way or the other a new
> > value is getting created. There's no way around it. There's only an
> > optimization to be had when it need not be long lived. For example the
> > iterators over sequence file key/values offer exactly this choice and
> > the default is to *not* reuse.
> >
> > The reuse is a nice win in cases where you know it's going to be a
> benefit.
> > If this particular iterator will never possibly take advantage of this
> > optimization, there's no value in offering it, yes.
> >
> > You could offer the option, if it is used in cases where the
> > optimization works, but that's extra complexity. I think I'd just keep
> > an eye on performance impact and consider that option to have it both
> > ways as a possible solution if needed.
> >
> > On Fri, Apr 12, 2013 at 8:31 PM, Dan Filimon
> > <da...@gmail.com> wrote:
> > > The problem with the 0th option is that the elements provide no way of
> > > cloning them.
> > >
> > > Also, reusing the value causes the problem in the first post, where the
> > > element advances even though it shouldn't.
> > > Calling next(), keeping the reference and expecting the value to not
> > change
> > > when calling hasNext() seems reasonable to me.
> > >
> > > It seems non-intuitive that it should be cloned, because calling
> > hasNext()
> > > could invalidate it. In my mind, the current element should only change
> > > when calling next().
> > >
> > >
> > >
> > > On Fri, Apr 12, 2013 at 10:22 PM, Sean Owen <sr...@gmail.com> wrote:
> > >
> > >> I'm sure I did (at least much of) the AbstractIterator change so blame
> > >> me... but I think the pattern itself is just fine. It's used in many
> > >> places in the project. Reusing the value object is a big win in some
> > >> places. Allocating objects is fast but a trillion of them still adds
> > >> up.
> > >>
> > >> It does contain a requirement, and that is that the caller is supposed
> > >> to copy/clone the value if it will be used at all after the next
> > >> iterator operation. That's the 0th option, to just fix the caller
> > >> here.
> > >>
> > >> On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com>
> > >> wrote:
> > >> > The contract of computeNext is that there are no side effects
> visible
> > >> > outside (i.e. apparent functional style).  This is required since
> > >> > computeNext is called from hasNext().
> > >> >
> > >> > We are using a side-effecting style so we have a bug.
> > >> >
> > >> > We have two choices:
> > >> >
> > >> > a) use functional style. This will *require* that we allocate a new
> > >> > container element on every call to computeNext.  This is best for
> the
> > >> user
> > >> > because they will have fewer surprising bugs due to reuse.  If
> > allocation
> > >> > is actually as bad as some people think (I remain skeptical of that
> > >> without
> > >> > tests) then this is a bad move.  If allocation of totally ephemeral
> > >> objects
> > >> > is as cheap as I think, then this would be a good move.
> > >> >
> > >> > b) stop using AbstractIterator and continue with the re-use style.
> >  And
> > >> add
> > >> > a comment to prevent a bright spark from reverting this change.  (I
> > >> suspect
> > >> > that the bright spark who did this in the first place was me so I
> can
> > be
> > >> > rude)
> > >>
> >
>
>
>
> --
>
>   -jake
>

Re: Odd vector iteration behavior

Posted by Jake Mannix <ja...@gmail.com>.
I think requiring the caller to know to copy/clone the element to be allowed
to call hasNext() multiple times is extremely non-intuitive.  Having the
caller
know that it's dangerous / not allowed to hang onto an element without
copying while continuing to iterate (e.g. when looking for the "largest"
element)
is fine, as this is used all over this project, as well as being the
general contract
with Writables in Hadoop.

I think the most straightforward fix is Ted's b) solution.  Continue with
re-use,
but make sure that hasNext() is side-effect free by dropping
AbstractIterator
as our impl here.


On Fri, Apr 12, 2013 at 12:37 PM, Sean Owen <sr...@gmail.com> wrote:

> Yeah you have to have a clone() or copy constructor or something. Add
> that? that solves the other problem. At least, that's the intent and
> it is an option.
>
> If you need the value to be long-lived, one way or the other a new
> value is getting created. There's no way around it. There's only an
> optimization to be had when it need not be long lived. For example the
> iterators over sequence file key/values offer exactly this choice and
> the default is to *not* reuse.
>
> The reuse is a nice win in cases where you know it's going to be a benefit.
> If this particular iterator will never possibly take advantage of this
> optimization, there's no value in offering it, yes.
>
> You could offer the option, if it is used in cases where the
> optimization works, but that's extra complexity. I think I'd just keep
> an eye on performance impact and consider that option to have it both
> ways as a possible solution if needed.
>
> On Fri, Apr 12, 2013 at 8:31 PM, Dan Filimon
> <da...@gmail.com> wrote:
> > The problem with the 0th option is that the elements provide no way of
> > cloning them.
> >
> > Also, reusing the value causes the problem in the first post, where the
> > element advances even though it shouldn't.
> > Calling next(), keeping the reference and expecting the value to not
> change
> > when calling hasNext() seems reasonable to me.
> >
> > It seems non-intuitive that it should be cloned, because calling
> hasNext()
> > could invalidate it. In my mind, the current element should only change
> > when calling next().
> >
> >
> >
> > On Fri, Apr 12, 2013 at 10:22 PM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> I'm sure I did (at least much of) the AbstractIterator change so blame
> >> me... but I think the pattern itself is just fine. It's used in many
> >> places in the project. Reusing the value object is a big win in some
> >> places. Allocating objects is fast but a trillion of them still adds
> >> up.
> >>
> >> It does contain a requirement, and that is that the caller is supposed
> >> to copy/clone the value if it will be used at all after the next
> >> iterator operation. That's the 0th option, to just fix the caller
> >> here.
> >>
> >> On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> > The contract of computeNext is that there are no side effects visible
> >> > outside (i.e. apparent functional style).  This is required since
> >> > computeNext is called from hasNext().
> >> >
> >> > We are using a side-effecting style so we have a bug.
> >> >
> >> > We have two choices:
> >> >
> >> > a) use functional style. This will *require* that we allocate a new
> >> > container element on every call to computeNext.  This is best for the
> >> user
> >> > because they will have fewer surprising bugs due to reuse.  If
> allocation
> >> > is actually as bad as some people think (I remain skeptical of that
> >> without
> >> > tests) then this is a bad move.  If allocation of totally ephemeral
> >> objects
> >> > is as cheap as I think, then this would be a good move.
> >> >
> >> > b) stop using AbstractIterator and continue with the re-use style.
>  And
> >> add
> >> > a comment to prevent a bright spark from reverting this change.  (I
> >> suspect
> >> > that the bright spark who did this in the first place was me so I can
> be
> >> > rude)
> >>
>



-- 

  -jake

Re: Odd vector iteration behavior

Posted by Sean Owen <sr...@gmail.com>.
Yeah you have to have a clone() or copy constructor or something. Add
that? that solves the other problem. At least, that's the intent and
it is an option.

If you need the value to be long-lived, one way or the other a new
value is getting created. There's no way around it. There's only an
optimization to be had when it need not be long lived. For example the
iterators over sequence file key/values offer exactly this choice and
the default is to *not* reuse.

The reuse is a nice win in cases where you know it's going to be a benefit.
If this particular iterator will never possibly take advantage of this
optimization, there's no value in offering it, yes.

You could offer the option, if it is used in cases where the
optimization works, but that's extra complexity. I think I'd just keep
an eye on performance impact and consider that option to have it both
ways as a possible solution if needed.

On Fri, Apr 12, 2013 at 8:31 PM, Dan Filimon
<da...@gmail.com> wrote:
> The problem with the 0th option is that the elements provide no way of
> cloning them.
>
> Also, reusing the value causes the problem in the first post, where the
> element advances even though it shouldn't.
> Calling next(), keeping the reference and expecting the value to not change
> when calling hasNext() seems reasonable to me.
>
> It seems non-intuitive that it should be cloned, because calling hasNext()
> could invalidate it. In my mind, the current element should only change
> when calling next().
>
>
>
> On Fri, Apr 12, 2013 at 10:22 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> I'm sure I did (at least much of) the AbstractIterator change so blame
>> me... but I think the pattern itself is just fine. It's used in many
>> places in the project. Reusing the value object is a big win in some
>> places. Allocating objects is fast but a trillion of them still adds
>> up.
>>
>> It does contain a requirement, and that is that the caller is supposed
>> to copy/clone the value if it will be used at all after the next
>> iterator operation. That's the 0th option, to just fix the caller
>> here.
>>
>> On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > The contract of computeNext is that there are no side effects visible
>> > outside (i.e. apparent functional style).  This is required since
>> > computeNext is called from hasNext().
>> >
>> > We are using a side-effecting style so we have a bug.
>> >
>> > We have two choices:
>> >
>> > a) use functional style. This will *require* that we allocate a new
>> > container element on every call to computeNext.  This is best for the
>> user
>> > because they will have fewer surprising bugs due to reuse.  If allocation
>> > is actually as bad as some people think (I remain skeptical of that
>> without
>> > tests) then this is a bad move.  If allocation of totally ephemeral
>> objects
>> > is as cheap as I think, then this would be a good move.
>> >
>> > b) stop using AbstractIterator and continue with the re-use style.  And
>> add
>> > a comment to prevent a bright spark from reverting this change.  (I
>> suspect
>> > that the bright spark who did this in the first place was me so I can be
>> > rude)
>>

Re: Odd vector iteration behavior

Posted by Dan Filimon <da...@gmail.com>.
The problem with the 0th option is that the elements provide no way of
cloning them.

Also, reusing the value causes the problem in the first post, where the
element advances even though it shouldn't.
Calling next(), keeping the reference and expecting the value to not change
when calling hasNext() seems reasonable to me.

It seems non-intuitive that it should be cloned, because calling hasNext()
could invalidate it. In my mind, the current element should only change
when calling next().



On Fri, Apr 12, 2013 at 10:22 PM, Sean Owen <sr...@gmail.com> wrote:

> I'm sure I did (at least much of) the AbstractIterator change so blame
> me... but I think the pattern itself is just fine. It's used in many
> places in the project. Reusing the value object is a big win in some
> places. Allocating objects is fast but a trillion of them still adds
> up.
>
> It does contain a requirement, and that is that the caller is supposed
> to copy/clone the value if it will be used at all after the next
> iterator operation. That's the 0th option, to just fix the caller
> here.
>
> On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > The contract of computeNext is that there are no side effects visible
> > outside (i.e. apparent functional style).  This is required since
> > computeNext is called from hasNext().
> >
> > We are using a side-effecting style so we have a bug.
> >
> > We have two choices:
> >
> > a) use functional style. This will *require* that we allocate a new
> > container element on every call to computeNext.  This is best for the
> user
> > because they will have fewer surprising bugs due to reuse.  If allocation
> > is actually as bad as some people think (I remain skeptical of that
> without
> > tests) then this is a bad move.  If allocation of totally ephemeral
> objects
> > is as cheap as I think, then this would be a good move.
> >
> > b) stop using AbstractIterator and continue with the re-use style.  And
> add
> > a comment to prevent a bright spark from reverting this change.  (I
> suspect
> > that the bright spark who did this in the first place was me so I can be
> > rude)
>

Re: Odd vector iteration behavior

Posted by Sean Owen <sr...@gmail.com>.
The JVM can only do this sort of thing if the object doesn't escape
scope -- totally local. This wouldn't be possible here.

On Mon, Apr 15, 2013 at 12:55 AM, Ted Dunning <te...@gmail.com> wrote:
> Did you mark the class and fields all as final?
>
> That might help the compiler realize it could in-line stuff and avoid the
> constructor (not likely, but possible)
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Committed. I moved the getNumNonDefaultElements() change to a new method
getNumNonZeroElements() and used that for Pearson Correlation. The
getNumNonDefaultElements() will work as before, that really needs to be a
private method.

http://svn.apache.org/viewvc?view=revision&revision=1468209

Sync, let me know if you find any issues with this.

Robin


On Mon, Apr 15, 2013 at 2:30 PM, Robin Anil <ro...@gmail.com> wrote:

> Iterable is a safer interface, you can implement non-zero-ness check
> easily. Iterator is not.
>
> I think I have fixed all the failing tests (They were failing because the
> asFormatString order seems to have changed with the new iterators)
>
> https://reviews.apache.org/r/10455/diff/6/
>
>
>
> On Mon, Apr 15, 2013 at 2:21 PM, Jake Mannix <ja...@gmail.com>wrote:
>
>> On Mon, Apr 15, 2013 at 12:14 PM, Robin Anil <ro...@gmail.com>
>> wrote:
>>
>> > Another crazy idea for the future is to kill the usage of
>> > OpenIntDoubleHashMap entirely and copy parts of it inside RASV which
>> will
>> > only deal with nonzero keys and non zero values. RASV can then keep
>> track
>> > of non-zero elements in a variable to speed up those lookups.
>> >
>> >
>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> >
>> >
>> > On Mon, Apr 15, 2013 at 2:11 PM, Robin Anil <ro...@gmail.com>
>> wrote:
>> >
>> > > The point 3 is coming from the philosophy that all Vectors behave the
>> > same
>> > > way and numNonDefaultElements of a DenseVector is same as that of a
>> > > SparseVector. Eg, if PersonSimilarity relies upon it for document
>> length,
>> > > it should be behave the same way.
>> > >
>> > > The point 4 can be solved by killing the iterator interface entirely
>> and
>> > > creating forEachNonZero(function()) method which will only call if the
>> > > element is nonzero.
>> >
>>
>> Killing iteration would be really really bad, from a useability
>> standpoint.
>>  In fact,
>> I've been moving in the other direction:
>> https://reviews.apache.org/r/9867/
>> adds iterators to the basic collection interface!
>>
>>
>>
>> >  >
>> > >
>> > >
>> > > On Mon, Apr 15, 2013 at 2:08 PM, Jake Mannix <jake.mannix@gmail.com
>> > >wrote:
>> > >
>> > >> On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <ro...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > This is what I propose:
>> > >> >
>> > >> > 1) Allow setting value to zero while iterating (e.set(0.0)).
>> > >> >
>> > >>
>> > >> This is in addition to the fact that we already allow setting nonzero
>> > >> values
>> > >> while iterating, right?
>> > >>
>> > >>
>> > >> > 2) Do not allow callers to use vector.set(index, 0.0) during
>> > iterating).
>> > >> > This can cause re-hashing. (Can set a dirty bit in the hashmap
>> during
>> > >> > rehash to throw a concurrent modified exception)
>> > >> >
>> > >>
>> > >> Agreed - this is a commonly accepted requirement: I think in fact we
>> > >> should pro-actively throw ConcurrentModificationException if someone
>> > >> tries to call vector.set / vector.assign while iterating.
>> > >>
>> > >>
>> > >> > 3) Update the numNonDefaultElements to iterate over the array to
>> > >> discount
>> > >> > 0.0 instead of returning the hashMap values.
>> > >> > 4) IterateNonZero may iterate over a few zeros if you did set the
>> > >> dimension
>> > >> > to 0. Most of the statistics code should handle 0 values correctly.
>> > >> >
>> > >>
>> > >> Yeah, are we really strict about getNumNonDefaultElements really
>> always
>> > >> returning exactly the number of nonzeroes?  I was under the
>> impression
>> > >> that
>> > >> for e.g. DenseVector, it would give the overal size, even if some
>> were
>> > 0,
>> > >> and that it was basically tracking the amount of space the vector was
>> > >> taking
>> > >> up.  But I can see the argument that it really should return what it
>> > says
>> > >> it
>> > >> returns, if that is relied upon.
>> > >>
>> > >>
>> > >> >
>> > >> >
>> > >> >
>> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >> >
>> > >> >
>> > >> > On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <
>> jake.mannix@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Ah, this was the one corner case I was worried about - we do
>> > >> special-case
>> > >> > > setting to 0,
>> > >> > > as meaning remove from the hashmap, yes.
>> > >> > >
>> > >> > > What's the TL;DR of what you did to work around this?  Should we
>> > allow
>> > >> > > this?  Even
>> > >> > > if it's through the Vector.Element instance, should it be ok?  If
>> > so,
>> > >> how
>> > >> > > to handle?
>> > >> > >
>> > >> > >
>> > >> > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <
>> robin.anil@gmail.com>
>> > >> > wrote:
>> > >> > >
>> > >> > > > I am adding the tests and updating the patch.
>> > >> > > >
>> > >> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >> > > >
>> > >> > > >
>> > >> > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <
>> robin.anil@gmail.com
>> > >
>> > >> > > wrote:
>> > >> > > >
>> > >> > > > > You can re-iterate if the state is in iteration. But you
>> cannot
>> > >> > write.
>> > >> > > > >
>> > >> > > > > This is what is happening:
>> > >> > > > >
>> > >> > > > > One of the values are becoming 0. So Vector tries to remove
>> it
>> > >> from
>> > >> > the
>> > >> > > > > underlying hashmap. This changes the layout, if a vector has
>> to
>> > be
>> > >> > > > mutated
>> > >> > > > > while iterating, we have to set 0 value in the hashmap and
>> not
>> > >> remove
>> > >> > > it
>> > >> > > > > like what the Vector layer is doing. This adds another
>> > complexity,
>> > >> > the
>> > >> > > > > vector iterator has to deal with skipping over elements with
>> 0
>> > >> value.
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > Try this
>> > >> > > > >
>> > >> > > > > Create a vector of length 13 and set the following values.
>> > >> > > > >
>> > >> > > > >
>> > >> > > > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0,
>> 6,
>> > >> 0, 1,
>> > >> > > 1,
>> > >> > > > >    2, 1 };
>> > >> > > > >    2.     for (int i = 0; i < val.length; ++i) {
>> > >> > > > >    3.       vector.set(i, val[i]);
>> > >> > > > >    4.     }
>> > >> > > > >
>> > >> > > > > Iterate again and while iterating set one of the values as
>> zero.
>> > >> > > > >
>> > >> > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
>> > >> > > > dangeorge.filimon@gmail.com
>> > >> > > > > > wrote:
>> > >> > > > >
>> > >> > > > >> What kind of Vector is failing to set() in that code?
>> > >> > > > >>
>> > >> > > > >> About the state enum, what if (for whatever reason, not
>> > >> > > > >> multi-threaded-ness) there are multiple iterators to that
>> > vector?
>> > >> > > > >> Something like a reference count (how many iterators point
>> to
>> > it)
>> > >> > > would
>> > >> > > > >> probably be needed, and keeping it sane would only be
>> possible
>> > in
>> > >> > one
>> > >> > > > >> thread. Although this seems kind of brittle.
>> > >> > > > >>
>> > >> > > > >> +1 for numNonDefault.
>> > >> > > > >>
>> > >> > > > >>
>> > >> > > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <
>> > >> robin.anil@gmail.com>
>> > >> > > > wrote:
>> > >> > > > >>
>> > >> > > > >>> Another behavior difference.
>> > >> > > > >>>
>> > >> > > > >>> The numNonDefaultElement for a DenseVector returns the
>> total
>> > >> > length.
>> > >> > > > >>> This causes Pearson Correlation Similarity to differ from
>> if
>> > it
>> > >> was
>> > >> > > > >>> implemented using on of the SparseVector.
>> > >> > > > >>> I am proposing to fix the numNonDefaultElement to correctly
>> > >> iterate
>> > >> > > > over
>> > >> > > > >>> the dense vector to figure out non zero values ? Sounds ok
>> > >> > > > >>>
>> > >> > > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
>> > Inc.
>> > >> > > > >>>
>> > >> > > > >>>
>> > >> > > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <
>> > >> robin.anil@gmail.com
>> > >> > > > >wrote:
>> > >> > > > >>>
>> > >> > > > >>>> Found the bug PearsonCorrelationSimilarity was trying to
>> > mutate
>> > >> > the
>> > >> > > > >>>> object while iterating.
>> > >> > > > >>>>
>> > >> > > > >>>>
>> > >> > > > >>>>    1.     while (it.hasNext()) {
>> > >> > > > >>>>    2.       Vector.Element e = it.next();
>> > >> > > > >>>>    3.       *vector.set(e.index(),* e.get() - average);
>> > >> > > > >>>>    4.     }
>> > >> > > > >>>>
>> > >> > > > >>>> This has a side effect of causing the underlying hash-map
>> or
>> > >> > object
>> > >> > > to
>> > >> > > > >>>> change.
>> > >> > > > >>>>
>> > >> > > > >>>> The right behavior is to set the value of the index while
>> > >> > iterating.
>> > >> > > > >>>>
>> > >> > > > >>>>    1.     while (it.hasNext()) {
>> > >> > > > >>>>    2.       Vector.Element e = it.next();
>> > >> > > > >>>>    3.       *e.set(e.get()* - average);
>> > >> > > > >>>>    4.     }
>> > >> > > > >>>>
>> > >> > > > >>>> I am sure we are incorrectly doing the first style across
>> the
>> > >> code
>> > >> > > at
>> > >> > > > >>>> many places.
>> > >> > > > >>>>
>> > >> > > > >>>> I am proposing this
>> > >> > > > >>>>
>> > >> > > > >>>> When iterating, we lock the set interface on the vector
>> > using a
>> > >> > > State
>> > >> > > > >>>> enum. If anyone tries to mutate, we throw an exception.
>> > >> > > > >>>> We flip the state when we complete iterating (hasNext =
>> > false)
>> > >> or
>> > >> > > when
>> > >> > > > >>>> we explicitly close the iterator (adding a close method on
>> > the
>> > >> > > > iterator).
>> > >> > > > >>>>
>> > >> > > > >>>> Again this is all a single thread fix. if a vector is
>> being
>> > >> > mutated
>> > >> > > > and
>> > >> > > > >>>> iterated across multiple threads, all hell can break
>> loose.
>> > >> > > > >>>>
>> > >> > > > >>>> Robin
>> > >> > > > >>>>
>> > >> > > > >>>>
>> > >> > > > >>>>
>> > >> > > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <
>> > >> > robin.anil@gmail.com
>> > >> > > > >wrote:
>> > >> > > > >>>>
>> > >> > > > >>>>> Spoke too soon still failure.  I am uploading the latest
>> > >> patch.
>> > >> > > These
>> > >> > > > >>>>> are the current failing tests.
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> > >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> > >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> > >> > > > >>>>> null
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> > >> > > > >>>>> null
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>> > >> > > > >>>>> expected:<0.5303300858899108> but
>> was:<0.38729833462074176>
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 |
>> Google
>> > >> Inc.
>> > >> > > > >>>>>
>> > >> > > > >>>>>
>> > >> > > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <
>> > >> > robin.anil@gmail.com
>> > >> > > > >wrote:
>> > >> > > > >>>>>
>> > >> > > > >>>>>> Found it, fixed it. I am submitting soon.
>> > >> > > > >>>>>>
>> > >> > > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 |
>> Google
>> > >> Inc.
>> > >> > > > >>>>>>
>> > >> > > > >>>>>>
>> > >> > > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
>> > >> > > > ted.dunning@gmail.com>wrote:
>> > >> > > > >>>>>>
>> > >> > > > >>>>>>> Robin,
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>> Can you make sure that the patches are somewhere that
>> Dan
>> > >> can
>> > >> > > pick
>> > >> > > > >>>>>>> up this
>> > >> > > > >>>>>>> work?  He is in GMT+2 and is probably about to appear
>> on
>> > the
>> > >> > > scene.
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
>> > >> > > robin.anil@gmail.com>
>> > >> > > > >>>>>>> wrote:
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>> > Strike that there are still failures. Investigating.
>> if
>> > I
>> > >> > cant
>> > >> > > > fix
>> > >> > > > >>>>>>> it in
>> > >> > > > >>>>>>> > the next hour, I will submit them sometime in the
>> > evening
>> > >> > > > tomorrow.
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 |
>> > Google
>> > >> > Inc.
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
>> > >> > > > robin.anil@gmail.com>
>> > >> > > > >>>>>>> wrote:
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>> > > Tests pass. Submitting the patches.
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 |
>> > >> Google
>> > >> > > Inc.
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
>> > >> > > > >>>>>>> robin.anil@gmail.com>
>> > >> > > > >>>>>>> > wrote:
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> > >> Added a few more tests. Throw
>> NoSuchElementException
>> > >> like
>> > >> > > Java
>> > >> > > > >>>>>>> > >> Collections when iterating past the end. Things
>> look
>> > >> > solid,
>> > >> > > > >>>>>>> performance
>> > >> > > > >>>>>>> > is
>> > >> > > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the
>> > >> entire
>> > >> > > test
>> > >> > > > >>>>>>> suites to
>> > >> > > > >>>>>>> > run
>> > >> > > > >>>>>>> > >> before submitting.
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602|
>> > >> Google
>> > >> > > > Inc.
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
>> > >> > > > >>>>>>> robin.anil@gmail.com>
>> > >> > > > >>>>>>> > wrote:
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >>> I am not sure what I did. But removing Guava
>> > Abstract
>> > >> > > > iterator
>> > >> > > > >>>>>>> actually
>> > >> > > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another
>> 60%.
>> > >> Things
>> > >> > > are
>> > >> > > > >>>>>>> now 2x
>> > >> > > > >>>>>>> > faster
>> > >> > > > >>>>>>> > >>> than trunk. While also correcting the behavior (I
>> > >> hope)
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>>
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602|
>> > >> > Google
>> > >> > > > Inc.
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
>> > >> > > > >>>>>>> robin.anil@gmail.com
>> > >> > > > >>>>>>> > >wrote:
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>>> Also note that this is code gen, I have to
>> create
>> > >> > > > >>>>>>> > Element$keyType$Value
>> > >> > > > >>>>>>> > >>>> for each and every combination not just int
>> double.
>> > >> and
>> > >> > > also
>> > >> > > > >>>>>>> update
>> > >> > > > >>>>>>> > all
>> > >> > > > >>>>>>> > >>>> callers to user ElementIntDouble instead of
>> > Element.
>> > >> Is
>> > >> > it
>> > >> > > > >>>>>>> worth it ?
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869
>> 2602 |
>> > >> > Google
>> > >> > > > >>>>>>> Inc.
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>> > >> > > > >>>>>>> ted.dunning@gmail.com
>> > >> > > > >>>>>>> > >wrote:
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>>> Collections (no longer colt collections) are
>> now
>> > >> part
>> > >> > of
>> > >> > > > >>>>>>> mahout math.
>> > >> > > > >>>>>>> > >>>>>  No
>> > >> > > > >>>>>>> > >>>>> need to keep them separate.  The lower iterator
>> > can
>> > >> > > > reference
>> > >> > > > >>>>>>> > >>>>> Vector.Element
>> > >> > > > >>>>>>> > >>>>>
>> > >> > > > >>>>>>> > >>>>>
>> > >> > > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>> > >> > > > >>>>>>> robin.anil@gmail.com>
>> > >> > > > >>>>>>> > >>>>> wrote:
>> > >> > > > >>>>>>> > >>>>>
>> > >> > > > >>>>>>> > >>>>> > I would have loved to but Element is a sub
>> > >> interface
>> > >> > in
>> > >> > > > >>>>>>> Vector. If
>> > >> > > > >>>>>>> > >>>>> we want
>> > >> > > > >>>>>>> > >>>>> > to keep colt collections separate we have to
>> > keep
>> > >> > this
>> > >> > > > >>>>>>> separation.
>> > >> > > > >>>>>>> > >>>>> >
>> > >> > > > >>>>>>> > >>>>>
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>>
>> > >> > > > >>>>>>> > >>>
>> > >> > > > >>>>>>> > >>
>> > >> > > > >>>>>>> > >
>> > >> > > > >>>>>>> >
>> > >> > > > >>>>>>>
>> > >> > > > >>>>>>
>> > >> > > > >>>>>>
>> > >> > > > >>>>>
>> > >> > > > >>>>
>> > >> > > > >>>
>> > >> > > > >>
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > --
>> > >> > >
>> > >> > >   -jake
>> > >> > >
>> > >> >
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >>
>> > >>   -jake
>> > >>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>>
>>   -jake
>>
>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Iterable is a safer interface, you can implement non-zero-ness check
easily. Iterator is not.

I think I have fixed all the failing tests (They were failing because the
asFormatString order seems to have changed with the new iterators)

https://reviews.apache.org/r/10455/diff/6/

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 2:21 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Mon, Apr 15, 2013 at 12:14 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Another crazy idea for the future is to kill the usage of
> > OpenIntDoubleHashMap entirely and copy parts of it inside RASV which will
> > only deal with nonzero keys and non zero values. RASV can then keep track
> > of non-zero elements in a variable to speed up those lookups.
> >
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Mon, Apr 15, 2013 at 2:11 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > The point 3 is coming from the philosophy that all Vectors behave the
> > same
> > > way and numNonDefaultElements of a DenseVector is same as that of a
> > > SparseVector. Eg, if PersonSimilarity relies upon it for document
> length,
> > > it should be behave the same way.
> > >
> > > The point 4 can be solved by killing the iterator interface entirely
> and
> > > creating forEachNonZero(function()) method which will only call if the
> > > element is nonzero.
> >
>
> Killing iteration would be really really bad, from a useability standpoint.
>  In fact,
> I've been moving in the other direction:
> https://reviews.apache.org/r/9867/
> adds iterators to the basic collection interface!
>
>
>
> >  >
> > >
> > >
> > > On Mon, Apr 15, 2013 at 2:08 PM, Jake Mannix <jake.mannix@gmail.com
> > >wrote:
> > >
> > >> On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <ro...@gmail.com>
> > >> wrote:
> > >>
> > >> > This is what I propose:
> > >> >
> > >> > 1) Allow setting value to zero while iterating (e.set(0.0)).
> > >> >
> > >>
> > >> This is in addition to the fact that we already allow setting nonzero
> > >> values
> > >> while iterating, right?
> > >>
> > >>
> > >> > 2) Do not allow callers to use vector.set(index, 0.0) during
> > iterating).
> > >> > This can cause re-hashing. (Can set a dirty bit in the hashmap
> during
> > >> > rehash to throw a concurrent modified exception)
> > >> >
> > >>
> > >> Agreed - this is a commonly accepted requirement: I think in fact we
> > >> should pro-actively throw ConcurrentModificationException if someone
> > >> tries to call vector.set / vector.assign while iterating.
> > >>
> > >>
> > >> > 3) Update the numNonDefaultElements to iterate over the array to
> > >> discount
> > >> > 0.0 instead of returning the hashMap values.
> > >> > 4) IterateNonZero may iterate over a few zeros if you did set the
> > >> dimension
> > >> > to 0. Most of the statistics code should handle 0 values correctly.
> > >> >
> > >>
> > >> Yeah, are we really strict about getNumNonDefaultElements really
> always
> > >> returning exactly the number of nonzeroes?  I was under the impression
> > >> that
> > >> for e.g. DenseVector, it would give the overal size, even if some were
> > 0,
> > >> and that it was basically tracking the amount of space the vector was
> > >> taking
> > >> up.  But I can see the argument that it really should return what it
> > says
> > >> it
> > >> returns, if that is relied upon.
> > >>
> > >>
> > >> >
> > >> >
> > >> >
> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >> >
> > >> >
> > >> > On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <jake.mannix@gmail.com
> >
> > >> > wrote:
> > >> >
> > >> > > Ah, this was the one corner case I was worried about - we do
> > >> special-case
> > >> > > setting to 0,
> > >> > > as meaning remove from the hashmap, yes.
> > >> > >
> > >> > > What's the TL;DR of what you did to work around this?  Should we
> > allow
> > >> > > this?  Even
> > >> > > if it's through the Vector.Element instance, should it be ok?  If
> > so,
> > >> how
> > >> > > to handle?
> > >> > >
> > >> > >
> > >> > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <
> robin.anil@gmail.com>
> > >> > wrote:
> > >> > >
> > >> > > > I am adding the tests and updating the patch.
> > >> > > >
> > >> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <
> robin.anil@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > You can re-iterate if the state is in iteration. But you
> cannot
> > >> > write.
> > >> > > > >
> > >> > > > > This is what is happening:
> > >> > > > >
> > >> > > > > One of the values are becoming 0. So Vector tries to remove it
> > >> from
> > >> > the
> > >> > > > > underlying hashmap. This changes the layout, if a vector has
> to
> > be
> > >> > > > mutated
> > >> > > > > while iterating, we have to set 0 value in the hashmap and not
> > >> remove
> > >> > > it
> > >> > > > > like what the Vector layer is doing. This adds another
> > complexity,
> > >> > the
> > >> > > > > vector iterator has to deal with skipping over elements with 0
> > >> value.
> > >> > > > >
> > >> > > > >
> > >> > > > > Try this
> > >> > > > >
> > >> > > > > Create a vector of length 13 and set the following values.
> > >> > > > >
> > >> > > > >
> > >> > > > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0,
> 6,
> > >> 0, 1,
> > >> > > 1,
> > >> > > > >    2, 1 };
> > >> > > > >    2.     for (int i = 0; i < val.length; ++i) {
> > >> > > > >    3.       vector.set(i, val[i]);
> > >> > > > >    4.     }
> > >> > > > >
> > >> > > > > Iterate again and while iterating set one of the values as
> zero.
> > >> > > > >
> > >> > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> > >> > > > dangeorge.filimon@gmail.com
> > >> > > > > > wrote:
> > >> > > > >
> > >> > > > >> What kind of Vector is failing to set() in that code?
> > >> > > > >>
> > >> > > > >> About the state enum, what if (for whatever reason, not
> > >> > > > >> multi-threaded-ness) there are multiple iterators to that
> > vector?
> > >> > > > >> Something like a reference count (how many iterators point to
> > it)
> > >> > > would
> > >> > > > >> probably be needed, and keeping it sane would only be
> possible
> > in
> > >> > one
> > >> > > > >> thread. Although this seems kind of brittle.
> > >> > > > >>
> > >> > > > >> +1 for numNonDefault.
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <
> > >> robin.anil@gmail.com>
> > >> > > > wrote:
> > >> > > > >>
> > >> > > > >>> Another behavior difference.
> > >> > > > >>>
> > >> > > > >>> The numNonDefaultElement for a DenseVector returns the total
> > >> > length.
> > >> > > > >>> This causes Pearson Correlation Similarity to differ from if
> > it
> > >> was
> > >> > > > >>> implemented using on of the SparseVector.
> > >> > > > >>> I am proposing to fix the numNonDefaultElement to correctly
> > >> iterate
> > >> > > > over
> > >> > > > >>> the dense vector to figure out non zero values ? Sounds ok
> > >> > > > >>>
> > >> > > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > Inc.
> > >> > > > >>>
> > >> > > > >>>
> > >> > > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <
> > >> robin.anil@gmail.com
> > >> > > > >wrote:
> > >> > > > >>>
> > >> > > > >>>> Found the bug PearsonCorrelationSimilarity was trying to
> > mutate
> > >> > the
> > >> > > > >>>> object while iterating.
> > >> > > > >>>>
> > >> > > > >>>>
> > >> > > > >>>>    1.     while (it.hasNext()) {
> > >> > > > >>>>    2.       Vector.Element e = it.next();
> > >> > > > >>>>    3.       *vector.set(e.index(),* e.get() - average);
> > >> > > > >>>>    4.     }
> > >> > > > >>>>
> > >> > > > >>>> This has a side effect of causing the underlying hash-map
> or
> > >> > object
> > >> > > to
> > >> > > > >>>> change.
> > >> > > > >>>>
> > >> > > > >>>> The right behavior is to set the value of the index while
> > >> > iterating.
> > >> > > > >>>>
> > >> > > > >>>>    1.     while (it.hasNext()) {
> > >> > > > >>>>    2.       Vector.Element e = it.next();
> > >> > > > >>>>    3.       *e.set(e.get()* - average);
> > >> > > > >>>>    4.     }
> > >> > > > >>>>
> > >> > > > >>>> I am sure we are incorrectly doing the first style across
> the
> > >> code
> > >> > > at
> > >> > > > >>>> many places.
> > >> > > > >>>>
> > >> > > > >>>> I am proposing this
> > >> > > > >>>>
> > >> > > > >>>> When iterating, we lock the set interface on the vector
> > using a
> > >> > > State
> > >> > > > >>>> enum. If anyone tries to mutate, we throw an exception.
> > >> > > > >>>> We flip the state when we complete iterating (hasNext =
> > false)
> > >> or
> > >> > > when
> > >> > > > >>>> we explicitly close the iterator (adding a close method on
> > the
> > >> > > > iterator).
> > >> > > > >>>>
> > >> > > > >>>> Again this is all a single thread fix. if a vector is being
> > >> > mutated
> > >> > > > and
> > >> > > > >>>> iterated across multiple threads, all hell can break loose.
> > >> > > > >>>>
> > >> > > > >>>> Robin
> > >> > > > >>>>
> > >> > > > >>>>
> > >> > > > >>>>
> > >> > > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <
> > >> > robin.anil@gmail.com
> > >> > > > >wrote:
> > >> > > > >>>>
> > >> > > > >>>>> Spoke too soon still failure.  I am uploading the latest
> > >> patch.
> > >> > > These
> > >> > > > >>>>> are the current failing tests.
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > >> > > > >>>>> null
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > >> > > > >>>>> null
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> > >> > > > >>>>> expected:<0.5303300858899108> but
> was:<0.38729833462074176>
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > >> Inc.
> > >> > > > >>>>>
> > >> > > > >>>>>
> > >> > > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <
> > >> > robin.anil@gmail.com
> > >> > > > >wrote:
> > >> > > > >>>>>
> > >> > > > >>>>>> Found it, fixed it. I am submitting soon.
> > >> > > > >>>>>>
> > >> > > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 |
> Google
> > >> Inc.
> > >> > > > >>>>>>
> > >> > > > >>>>>>
> > >> > > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> > >> > > > ted.dunning@gmail.com>wrote:
> > >> > > > >>>>>>
> > >> > > > >>>>>>> Robin,
> > >> > > > >>>>>>>
> > >> > > > >>>>>>> Can you make sure that the patches are somewhere that
> Dan
> > >> can
> > >> > > pick
> > >> > > > >>>>>>> up this
> > >> > > > >>>>>>> work?  He is in GMT+2 and is probably about to appear on
> > the
> > >> > > scene.
> > >> > > > >>>>>>>
> > >> > > > >>>>>>>
> > >> > > > >>>>>>>
> > >> > > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
> > >> > > robin.anil@gmail.com>
> > >> > > > >>>>>>> wrote:
> > >> > > > >>>>>>>
> > >> > > > >>>>>>> > Strike that there are still failures. Investigating.
> if
> > I
> > >> > cant
> > >> > > > fix
> > >> > > > >>>>>>> it in
> > >> > > > >>>>>>> > the next hour, I will submit them sometime in the
> > evening
> > >> > > > tomorrow.
> > >> > > > >>>>>>> >
> > >> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 |
> > Google
> > >> > Inc.
> > >> > > > >>>>>>> >
> > >> > > > >>>>>>> >
> > >> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
> > >> > > > robin.anil@gmail.com>
> > >> > > > >>>>>>> wrote:
> > >> > > > >>>>>>> >
> > >> > > > >>>>>>> > > Tests pass. Submitting the patches.
> > >> > > > >>>>>>> > >
> > >> > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 |
> > >> Google
> > >> > > Inc.
> > >> > > > >>>>>>> > >
> > >> > > > >>>>>>> > >
> > >> > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> > >> > > > >>>>>>> robin.anil@gmail.com>
> > >> > > > >>>>>>> > wrote:
> > >> > > > >>>>>>> > >
> > >> > > > >>>>>>> > >> Added a few more tests. Throw
> NoSuchElementException
> > >> like
> > >> > > Java
> > >> > > > >>>>>>> > >> Collections when iterating past the end. Things
> look
> > >> > solid,
> > >> > > > >>>>>>> performance
> > >> > > > >>>>>>> > is
> > >> > > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the
> > >> entire
> > >> > > test
> > >> > > > >>>>>>> suites to
> > >> > > > >>>>>>> > run
> > >> > > > >>>>>>> > >> before submitting.
> > >> > > > >>>>>>> > >>
> > >> > > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 |
> > >> Google
> > >> > > > Inc.
> > >> > > > >>>>>>> > >>
> > >> > > > >>>>>>> > >>
> > >> > > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> > >> > > > >>>>>>> robin.anil@gmail.com>
> > >> > > > >>>>>>> > wrote:
> > >> > > > >>>>>>> > >>
> > >> > > > >>>>>>> > >>> I am not sure what I did. But removing Guava
> > Abstract
> > >> > > > iterator
> > >> > > > >>>>>>> actually
> > >> > > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%.
> > >> Things
> > >> > > are
> > >> > > > >>>>>>> now 2x
> > >> > > > >>>>>>> > faster
> > >> > > > >>>>>>> > >>> than trunk. While also correcting the behavior (I
> > >> hope)
> > >> > > > >>>>>>> > >>>
> > >> > > > >>>>>>> > >>>
> > >> > > > >>>>>>> > >>>
> > >> > > > >>>>>>> >
> > >> > > > >>>>>>>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > >> > > > >>>>>>> > >>>
> > >> > > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602|
> > >> > Google
> > >> > > > Inc.
> > >> > > > >>>>>>> > >>>
> > >> > > > >>>>>>> > >>>
> > >> > > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> > >> > > > >>>>>>> robin.anil@gmail.com
> > >> > > > >>>>>>> > >wrote:
> > >> > > > >>>>>>> > >>>
> > >> > > > >>>>>>> > >>>> Also note that this is code gen, I have to create
> > >> > > > >>>>>>> > Element$keyType$Value
> > >> > > > >>>>>>> > >>>> for each and every combination not just int
> double.
> > >> and
> > >> > > also
> > >> > > > >>>>>>> update
> > >> > > > >>>>>>> > all
> > >> > > > >>>>>>> > >>>> callers to user ElementIntDouble instead of
> > Element.
> > >> Is
> > >> > it
> > >> > > > >>>>>>> worth it ?
> > >> > > > >>>>>>> > >>>>
> > >> > > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602|
> > >> > Google
> > >> > > > >>>>>>> Inc.
> > >> > > > >>>>>>> > >>>>
> > >> > > > >>>>>>> > >>>>
> > >> > > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> > >> > > > >>>>>>> ted.dunning@gmail.com
> > >> > > > >>>>>>> > >wrote:
> > >> > > > >>>>>>> > >>>>
> > >> > > > >>>>>>> > >>>>> Collections (no longer colt collections) are now
> > >> part
> > >> > of
> > >> > > > >>>>>>> mahout math.
> > >> > > > >>>>>>> > >>>>>  No
> > >> > > > >>>>>>> > >>>>> need to keep them separate.  The lower iterator
> > can
> > >> > > > reference
> > >> > > > >>>>>>> > >>>>> Vector.Element
> > >> > > > >>>>>>> > >>>>>
> > >> > > > >>>>>>> > >>>>>
> > >> > > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> > >> > > > >>>>>>> robin.anil@gmail.com>
> > >> > > > >>>>>>> > >>>>> wrote:
> > >> > > > >>>>>>> > >>>>>
> > >> > > > >>>>>>> > >>>>> > I would have loved to but Element is a sub
> > >> interface
> > >> > in
> > >> > > > >>>>>>> Vector. If
> > >> > > > >>>>>>> > >>>>> we want
> > >> > > > >>>>>>> > >>>>> > to keep colt collections separate we have to
> > keep
> > >> > this
> > >> > > > >>>>>>> separation.
> > >> > > > >>>>>>> > >>>>> >
> > >> > > > >>>>>>> > >>>>>
> > >> > > > >>>>>>> > >>>>
> > >> > > > >>>>>>> > >>>>
> > >> > > > >>>>>>> > >>>
> > >> > > > >>>>>>> > >>
> > >> > > > >>>>>>> > >
> > >> > > > >>>>>>> >
> > >> > > > >>>>>>>
> > >> > > > >>>>>>
> > >> > > > >>>>>>
> > >> > > > >>>>>
> > >> > > > >>>>
> > >> > > > >>>
> > >> > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > >   -jake
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >>   -jake
> > >>
> > >
> > >
> >
>
>
>
> --
>
>   -jake
>

Re: Odd vector iteration behavior

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, Apr 15, 2013 at 12:14 PM, Robin Anil <ro...@gmail.com> wrote:

> Another crazy idea for the future is to kill the usage of
> OpenIntDoubleHashMap entirely and copy parts of it inside RASV which will
> only deal with nonzero keys and non zero values. RASV can then keep track
> of non-zero elements in a variable to speed up those lookups.
>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 2:11 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > The point 3 is coming from the philosophy that all Vectors behave the
> same
> > way and numNonDefaultElements of a DenseVector is same as that of a
> > SparseVector. Eg, if PersonSimilarity relies upon it for document length,
> > it should be behave the same way.
> >
> > The point 4 can be solved by killing the iterator interface entirely and
> > creating forEachNonZero(function()) method which will only call if the
> > element is nonzero.
>

Killing iteration would be really really bad, from a useability standpoint.
 In fact,
I've been moving in the other direction: https://reviews.apache.org/r/9867/
adds iterators to the basic collection interface!



>  >
> >
> >
> > On Mon, Apr 15, 2013 at 2:08 PM, Jake Mannix <jake.mannix@gmail.com
> >wrote:
> >
> >> On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <ro...@gmail.com>
> >> wrote:
> >>
> >> > This is what I propose:
> >> >
> >> > 1) Allow setting value to zero while iterating (e.set(0.0)).
> >> >
> >>
> >> This is in addition to the fact that we already allow setting nonzero
> >> values
> >> while iterating, right?
> >>
> >>
> >> > 2) Do not allow callers to use vector.set(index, 0.0) during
> iterating).
> >> > This can cause re-hashing. (Can set a dirty bit in the hashmap during
> >> > rehash to throw a concurrent modified exception)
> >> >
> >>
> >> Agreed - this is a commonly accepted requirement: I think in fact we
> >> should pro-actively throw ConcurrentModificationException if someone
> >> tries to call vector.set / vector.assign while iterating.
> >>
> >>
> >> > 3) Update the numNonDefaultElements to iterate over the array to
> >> discount
> >> > 0.0 instead of returning the hashMap values.
> >> > 4) IterateNonZero may iterate over a few zeros if you did set the
> >> dimension
> >> > to 0. Most of the statistics code should handle 0 values correctly.
> >> >
> >>
> >> Yeah, are we really strict about getNumNonDefaultElements really always
> >> returning exactly the number of nonzeroes?  I was under the impression
> >> that
> >> for e.g. DenseVector, it would give the overal size, even if some were
> 0,
> >> and that it was basically tracking the amount of space the vector was
> >> taking
> >> up.  But I can see the argument that it really should return what it
> says
> >> it
> >> returns, if that is relied upon.
> >>
> >>
> >> >
> >> >
> >> >
> >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >> >
> >> >
> >> > On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <ja...@gmail.com>
> >> > wrote:
> >> >
> >> > > Ah, this was the one corner case I was worried about - we do
> >> special-case
> >> > > setting to 0,
> >> > > as meaning remove from the hashmap, yes.
> >> > >
> >> > > What's the TL;DR of what you did to work around this?  Should we
> allow
> >> > > this?  Even
> >> > > if it's through the Vector.Element instance, should it be ok?  If
> so,
> >> how
> >> > > to handle?
> >> > >
> >> > >
> >> > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <ro...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > I am adding the tests and updating the patch.
> >> > > >
> >> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >> > > >
> >> > > >
> >> > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <robin.anil@gmail.com
> >
> >> > > wrote:
> >> > > >
> >> > > > > You can re-iterate if the state is in iteration. But you cannot
> >> > write.
> >> > > > >
> >> > > > > This is what is happening:
> >> > > > >
> >> > > > > One of the values are becoming 0. So Vector tries to remove it
> >> from
> >> > the
> >> > > > > underlying hashmap. This changes the layout, if a vector has to
> be
> >> > > > mutated
> >> > > > > while iterating, we have to set 0 value in the hashmap and not
> >> remove
> >> > > it
> >> > > > > like what the Vector layer is doing. This adds another
> complexity,
> >> > the
> >> > > > > vector iterator has to deal with skipping over elements with 0
> >> value.
> >> > > > >
> >> > > > >
> >> > > > > Try this
> >> > > > >
> >> > > > > Create a vector of length 13 and set the following values.
> >> > > > >
> >> > > > >
> >> > > > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6,
> >> 0, 1,
> >> > > 1,
> >> > > > >    2, 1 };
> >> > > > >    2.     for (int i = 0; i < val.length; ++i) {
> >> > > > >    3.       vector.set(i, val[i]);
> >> > > > >    4.     }
> >> > > > >
> >> > > > > Iterate again and while iterating set one of the values as zero.
> >> > > > >
> >> > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> >> > > > dangeorge.filimon@gmail.com
> >> > > > > > wrote:
> >> > > > >
> >> > > > >> What kind of Vector is failing to set() in that code?
> >> > > > >>
> >> > > > >> About the state enum, what if (for whatever reason, not
> >> > > > >> multi-threaded-ness) there are multiple iterators to that
> vector?
> >> > > > >> Something like a reference count (how many iterators point to
> it)
> >> > > would
> >> > > > >> probably be needed, and keeping it sane would only be possible
> in
> >> > one
> >> > > > >> thread. Although this seems kind of brittle.
> >> > > > >>
> >> > > > >> +1 for numNonDefault.
> >> > > > >>
> >> > > > >>
> >> > > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <
> >> robin.anil@gmail.com>
> >> > > > wrote:
> >> > > > >>
> >> > > > >>> Another behavior difference.
> >> > > > >>>
> >> > > > >>> The numNonDefaultElement for a DenseVector returns the total
> >> > length.
> >> > > > >>> This causes Pearson Correlation Similarity to differ from if
> it
> >> was
> >> > > > >>> implemented using on of the SparseVector.
> >> > > > >>> I am proposing to fix the numNonDefaultElement to correctly
> >> iterate
> >> > > > over
> >> > > > >>> the dense vector to figure out non zero values ? Sounds ok
> >> > > > >>>
> >> > > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> >> > > > >>>
> >> > > > >>>
> >> > > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <
> >> robin.anil@gmail.com
> >> > > > >wrote:
> >> > > > >>>
> >> > > > >>>> Found the bug PearsonCorrelationSimilarity was trying to
> mutate
> >> > the
> >> > > > >>>> object while iterating.
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>>    1.     while (it.hasNext()) {
> >> > > > >>>>    2.       Vector.Element e = it.next();
> >> > > > >>>>    3.       *vector.set(e.index(),* e.get() - average);
> >> > > > >>>>    4.     }
> >> > > > >>>>
> >> > > > >>>> This has a side effect of causing the underlying hash-map or
> >> > object
> >> > > to
> >> > > > >>>> change.
> >> > > > >>>>
> >> > > > >>>> The right behavior is to set the value of the index while
> >> > iterating.
> >> > > > >>>>
> >> > > > >>>>    1.     while (it.hasNext()) {
> >> > > > >>>>    2.       Vector.Element e = it.next();
> >> > > > >>>>    3.       *e.set(e.get()* - average);
> >> > > > >>>>    4.     }
> >> > > > >>>>
> >> > > > >>>> I am sure we are incorrectly doing the first style across the
> >> code
> >> > > at
> >> > > > >>>> many places.
> >> > > > >>>>
> >> > > > >>>> I am proposing this
> >> > > > >>>>
> >> > > > >>>> When iterating, we lock the set interface on the vector
> using a
> >> > > State
> >> > > > >>>> enum. If anyone tries to mutate, we throw an exception.
> >> > > > >>>> We flip the state when we complete iterating (hasNext =
> false)
> >> or
> >> > > when
> >> > > > >>>> we explicitly close the iterator (adding a close method on
> the
> >> > > > iterator).
> >> > > > >>>>
> >> > > > >>>> Again this is all a single thread fix. if a vector is being
> >> > mutated
> >> > > > and
> >> > > > >>>> iterated across multiple threads, all hell can break loose.
> >> > > > >>>>
> >> > > > >>>> Robin
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <
> >> > robin.anil@gmail.com
> >> > > > >wrote:
> >> > > > >>>>
> >> > > > >>>>> Spoke too soon still failure.  I am uploading the latest
> >> patch.
> >> > > These
> >> > > > >>>>> are the current failing tests.
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > >
> >> > >
> >> >
> >>
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > >
> >> > >
> >> >
> >>
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> >> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > >
> >> > >
> >> >
> >>
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> >> > > > >>>>> null
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > >
> >> > >
> >> >
> >>
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> >> > > > >>>>> null
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > >
> >> > >
> >> >
> >>
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> >> > > > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> >> Inc.
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <
> >> > robin.anil@gmail.com
> >> > > > >wrote:
> >> > > > >>>>>
> >> > > > >>>>>> Found it, fixed it. I am submitting soon.
> >> > > > >>>>>>
> >> > > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> >> Inc.
> >> > > > >>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> >> > > > ted.dunning@gmail.com>wrote:
> >> > > > >>>>>>
> >> > > > >>>>>>> Robin,
> >> > > > >>>>>>>
> >> > > > >>>>>>> Can you make sure that the patches are somewhere that Dan
> >> can
> >> > > pick
> >> > > > >>>>>>> up this
> >> > > > >>>>>>> work?  He is in GMT+2 and is probably about to appear on
> the
> >> > > scene.
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
> >> > > robin.anil@gmail.com>
> >> > > > >>>>>>> wrote:
> >> > > > >>>>>>>
> >> > > > >>>>>>> > Strike that there are still failures. Investigating. if
> I
> >> > cant
> >> > > > fix
> >> > > > >>>>>>> it in
> >> > > > >>>>>>> > the next hour, I will submit them sometime in the
> evening
> >> > > > tomorrow.
> >> > > > >>>>>>> >
> >> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 |
> Google
> >> > Inc.
> >> > > > >>>>>>> >
> >> > > > >>>>>>> >
> >> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
> >> > > > robin.anil@gmail.com>
> >> > > > >>>>>>> wrote:
> >> > > > >>>>>>> >
> >> > > > >>>>>>> > > Tests pass. Submitting the patches.
> >> > > > >>>>>>> > >
> >> > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 |
> >> Google
> >> > > Inc.
> >> > > > >>>>>>> > >
> >> > > > >>>>>>> > >
> >> > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> >> > > > >>>>>>> robin.anil@gmail.com>
> >> > > > >>>>>>> > wrote:
> >> > > > >>>>>>> > >
> >> > > > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException
> >> like
> >> > > Java
> >> > > > >>>>>>> > >> Collections when iterating past the end. Things look
> >> > solid,
> >> > > > >>>>>>> performance
> >> > > > >>>>>>> > is
> >> > > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the
> >> entire
> >> > > test
> >> > > > >>>>>>> suites to
> >> > > > >>>>>>> > run
> >> > > > >>>>>>> > >> before submitting.
> >> > > > >>>>>>> > >>
> >> > > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 |
> >> Google
> >> > > > Inc.
> >> > > > >>>>>>> > >>
> >> > > > >>>>>>> > >>
> >> > > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> >> > > > >>>>>>> robin.anil@gmail.com>
> >> > > > >>>>>>> > wrote:
> >> > > > >>>>>>> > >>
> >> > > > >>>>>>> > >>> I am not sure what I did. But removing Guava
> Abstract
> >> > > > iterator
> >> > > > >>>>>>> actually
> >> > > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%.
> >> Things
> >> > > are
> >> > > > >>>>>>> now 2x
> >> > > > >>>>>>> > faster
> >> > > > >>>>>>> > >>> than trunk. While also correcting the behavior (I
> >> hope)
> >> > > > >>>>>>> > >>>
> >> > > > >>>>>>> > >>>
> >> > > > >>>>>>> > >>>
> >> > > > >>>>>>> >
> >> > > > >>>>>>>
> >> > > >
> >> > >
> >> >
> >>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> >> > > > >>>>>>> > >>>
> >> > > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 |
> >> > Google
> >> > > > Inc.
> >> > > > >>>>>>> > >>>
> >> > > > >>>>>>> > >>>
> >> > > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> >> > > > >>>>>>> robin.anil@gmail.com
> >> > > > >>>>>>> > >wrote:
> >> > > > >>>>>>> > >>>
> >> > > > >>>>>>> > >>>> Also note that this is code gen, I have to create
> >> > > > >>>>>>> > Element$keyType$Value
> >> > > > >>>>>>> > >>>> for each and every combination not just int double.
> >> and
> >> > > also
> >> > > > >>>>>>> update
> >> > > > >>>>>>> > all
> >> > > > >>>>>>> > >>>> callers to user ElementIntDouble instead of
> Element.
> >> Is
> >> > it
> >> > > > >>>>>>> worth it ?
> >> > > > >>>>>>> > >>>>
> >> > > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 |
> >> > Google
> >> > > > >>>>>>> Inc.
> >> > > > >>>>>>> > >>>>
> >> > > > >>>>>>> > >>>>
> >> > > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> >> > > > >>>>>>> ted.dunning@gmail.com
> >> > > > >>>>>>> > >wrote:
> >> > > > >>>>>>> > >>>>
> >> > > > >>>>>>> > >>>>> Collections (no longer colt collections) are now
> >> part
> >> > of
> >> > > > >>>>>>> mahout math.
> >> > > > >>>>>>> > >>>>>  No
> >> > > > >>>>>>> > >>>>> need to keep them separate.  The lower iterator
> can
> >> > > > reference
> >> > > > >>>>>>> > >>>>> Vector.Element
> >> > > > >>>>>>> > >>>>>
> >> > > > >>>>>>> > >>>>>
> >> > > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> >> > > > >>>>>>> robin.anil@gmail.com>
> >> > > > >>>>>>> > >>>>> wrote:
> >> > > > >>>>>>> > >>>>>
> >> > > > >>>>>>> > >>>>> > I would have loved to but Element is a sub
> >> interface
> >> > in
> >> > > > >>>>>>> Vector. If
> >> > > > >>>>>>> > >>>>> we want
> >> > > > >>>>>>> > >>>>> > to keep colt collections separate we have to
> keep
> >> > this
> >> > > > >>>>>>> separation.
> >> > > > >>>>>>> > >>>>> >
> >> > > > >>>>>>> > >>>>>
> >> > > > >>>>>>> > >>>>
> >> > > > >>>>>>> > >>>>
> >> > > > >>>>>>> > >>>
> >> > > > >>>>>>> > >>
> >> > > > >>>>>>> > >
> >> > > > >>>>>>> >
> >> > > > >>>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>
> >> > > > >>>>
> >> > > > >>>
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > >   -jake
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >>
> >>   -jake
> >>
> >
> >
>



-- 

  -jake

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Another crazy idea for the future is to kill the usage of
OpenIntDoubleHashMap entirely and copy parts of it inside RASV which will
only deal with nonzero keys and non zero values. RASV can then keep track
of non-zero elements in a variable to speed up those lookups.


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 2:11 PM, Robin Anil <ro...@gmail.com> wrote:

> The point 3 is coming from the philosophy that all Vectors behave the same
> way and numNonDefaultElements of a DenseVector is same as that of a
> SparseVector. Eg, if PersonSimilarity relies upon it for document length,
> it should be behave the same way.
>
> The point 4 can be solved by killing the iterator interface entirely and
> creating forEachNonZero(function()) method which will only call if the
> element is nonzero.
>
>
>
> On Mon, Apr 15, 2013 at 2:08 PM, Jake Mannix <ja...@gmail.com>wrote:
>
>> On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <ro...@gmail.com>
>> wrote:
>>
>> > This is what I propose:
>> >
>> > 1) Allow setting value to zero while iterating (e.set(0.0)).
>> >
>>
>> This is in addition to the fact that we already allow setting nonzero
>> values
>> while iterating, right?
>>
>>
>> > 2) Do not allow callers to use vector.set(index, 0.0) during iterating).
>> > This can cause re-hashing. (Can set a dirty bit in the hashmap during
>> > rehash to throw a concurrent modified exception)
>> >
>>
>> Agreed - this is a commonly accepted requirement: I think in fact we
>> should pro-actively throw ConcurrentModificationException if someone
>> tries to call vector.set / vector.assign while iterating.
>>
>>
>> > 3) Update the numNonDefaultElements to iterate over the array to
>> discount
>> > 0.0 instead of returning the hashMap values.
>> > 4) IterateNonZero may iterate over a few zeros if you did set the
>> dimension
>> > to 0. Most of the statistics code should handle 0 values correctly.
>> >
>>
>> Yeah, are we really strict about getNumNonDefaultElements really always
>> returning exactly the number of nonzeroes?  I was under the impression
>> that
>> for e.g. DenseVector, it would give the overal size, even if some were 0,
>> and that it was basically tracking the amount of space the vector was
>> taking
>> up.  But I can see the argument that it really should return what it says
>> it
>> returns, if that is relied upon.
>>
>>
>> >
>> >
>> >
>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> >
>> >
>> > On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <ja...@gmail.com>
>> > wrote:
>> >
>> > > Ah, this was the one corner case I was worried about - we do
>> special-case
>> > > setting to 0,
>> > > as meaning remove from the hashmap, yes.
>> > >
>> > > What's the TL;DR of what you did to work around this?  Should we allow
>> > > this?  Even
>> > > if it's through the Vector.Element instance, should it be ok?  If so,
>> how
>> > > to handle?
>> > >
>> > >
>> > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <ro...@gmail.com>
>> > wrote:
>> > >
>> > > > I am adding the tests and updating the patch.
>> > > >
>> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > > >
>> > > >
>> > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <ro...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > You can re-iterate if the state is in iteration. But you cannot
>> > write.
>> > > > >
>> > > > > This is what is happening:
>> > > > >
>> > > > > One of the values are becoming 0. So Vector tries to remove it
>> from
>> > the
>> > > > > underlying hashmap. This changes the layout, if a vector has to be
>> > > > mutated
>> > > > > while iterating, we have to set 0 value in the hashmap and not
>> remove
>> > > it
>> > > > > like what the Vector layer is doing. This adds another complexity,
>> > the
>> > > > > vector iterator has to deal with skipping over elements with 0
>> value.
>> > > > >
>> > > > >
>> > > > > Try this
>> > > > >
>> > > > > Create a vector of length 13 and set the following values.
>> > > > >
>> > > > >
>> > > > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6,
>> 0, 1,
>> > > 1,
>> > > > >    2, 1 };
>> > > > >    2.     for (int i = 0; i < val.length; ++i) {
>> > > > >    3.       vector.set(i, val[i]);
>> > > > >    4.     }
>> > > > >
>> > > > > Iterate again and while iterating set one of the values as zero.
>> > > > >
>> > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
>> > > > dangeorge.filimon@gmail.com
>> > > > > > wrote:
>> > > > >
>> > > > >> What kind of Vector is failing to set() in that code?
>> > > > >>
>> > > > >> About the state enum, what if (for whatever reason, not
>> > > > >> multi-threaded-ness) there are multiple iterators to that vector?
>> > > > >> Something like a reference count (how many iterators point to it)
>> > > would
>> > > > >> probably be needed, and keeping it sane would only be possible in
>> > one
>> > > > >> thread. Although this seems kind of brittle.
>> > > > >>
>> > > > >> +1 for numNonDefault.
>> > > > >>
>> > > > >>
>> > > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <
>> robin.anil@gmail.com>
>> > > > wrote:
>> > > > >>
>> > > > >>> Another behavior difference.
>> > > > >>>
>> > > > >>> The numNonDefaultElement for a DenseVector returns the total
>> > length.
>> > > > >>> This causes Pearson Correlation Similarity to differ from if it
>> was
>> > > > >>> implemented using on of the SparseVector.
>> > > > >>> I am proposing to fix the numNonDefaultElement to correctly
>> iterate
>> > > > over
>> > > > >>> the dense vector to figure out non zero values ? Sounds ok
>> > > > >>>
>> > > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > > > >>>
>> > > > >>>
>> > > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <
>> robin.anil@gmail.com
>> > > > >wrote:
>> > > > >>>
>> > > > >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate
>> > the
>> > > > >>>> object while iterating.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>>    1.     while (it.hasNext()) {
>> > > > >>>>    2.       Vector.Element e = it.next();
>> > > > >>>>    3.       *vector.set(e.index(),* e.get() - average);
>> > > > >>>>    4.     }
>> > > > >>>>
>> > > > >>>> This has a side effect of causing the underlying hash-map or
>> > object
>> > > to
>> > > > >>>> change.
>> > > > >>>>
>> > > > >>>> The right behavior is to set the value of the index while
>> > iterating.
>> > > > >>>>
>> > > > >>>>    1.     while (it.hasNext()) {
>> > > > >>>>    2.       Vector.Element e = it.next();
>> > > > >>>>    3.       *e.set(e.get()* - average);
>> > > > >>>>    4.     }
>> > > > >>>>
>> > > > >>>> I am sure we are incorrectly doing the first style across the
>> code
>> > > at
>> > > > >>>> many places.
>> > > > >>>>
>> > > > >>>> I am proposing this
>> > > > >>>>
>> > > > >>>> When iterating, we lock the set interface on the vector using a
>> > > State
>> > > > >>>> enum. If anyone tries to mutate, we throw an exception.
>> > > > >>>> We flip the state when we complete iterating (hasNext = false)
>> or
>> > > when
>> > > > >>>> we explicitly close the iterator (adding a close method on the
>> > > > iterator).
>> > > > >>>>
>> > > > >>>> Again this is all a single thread fix. if a vector is being
>> > mutated
>> > > > and
>> > > > >>>> iterated across multiple threads, all hell can break loose.
>> > > > >>>>
>> > > > >>>> Robin
>> > > > >>>>
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <
>> > robin.anil@gmail.com
>> > > > >wrote:
>> > > > >>>>
>> > > > >>>>> Spoke too soon still failure.  I am uploading the latest
>> patch.
>> > > These
>> > > > >>>>> are the current failing tests.
>> > > > >>>>>
>> > > > >>>>>
>> > > >
>> > >
>> >
>>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
>> > > > >>>>>
>> > > > >>>>>
>> > > >
>> > >
>> >
>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
>> > > > >>>>>
>> > > > >>>>>
>> > > >
>> > >
>> >
>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> > > > >>>>> null
>> > > > >>>>>
>> > > > >>>>>
>> > > >
>> > >
>> >
>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> > > > >>>>> null
>> > > > >>>>>
>> > > > >>>>>
>> > > >
>> > >
>> >
>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>> > > > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
>> Inc.
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <
>> > robin.anil@gmail.com
>> > > > >wrote:
>> > > > >>>>>
>> > > > >>>>>> Found it, fixed it. I am submitting soon.
>> > > > >>>>>>
>> > > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
>> Inc.
>> > > > >>>>>>
>> > > > >>>>>>
>> > > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
>> > > > ted.dunning@gmail.com>wrote:
>> > > > >>>>>>
>> > > > >>>>>>> Robin,
>> > > > >>>>>>>
>> > > > >>>>>>> Can you make sure that the patches are somewhere that Dan
>> can
>> > > pick
>> > > > >>>>>>> up this
>> > > > >>>>>>> work?  He is in GMT+2 and is probably about to appear on the
>> > > scene.
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
>> > > robin.anil@gmail.com>
>> > > > >>>>>>> wrote:
>> > > > >>>>>>>
>> > > > >>>>>>> > Strike that there are still failures. Investigating. if I
>> > cant
>> > > > fix
>> > > > >>>>>>> it in
>> > > > >>>>>>> > the next hour, I will submit them sometime in the evening
>> > > > tomorrow.
>> > > > >>>>>>> >
>> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google
>> > Inc.
>> > > > >>>>>>> >
>> > > > >>>>>>> >
>> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
>> > > > robin.anil@gmail.com>
>> > > > >>>>>>> wrote:
>> > > > >>>>>>> >
>> > > > >>>>>>> > > Tests pass. Submitting the patches.
>> > > > >>>>>>> > >
>> > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 |
>> Google
>> > > Inc.
>> > > > >>>>>>> > >
>> > > > >>>>>>> > >
>> > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
>> > > > >>>>>>> robin.anil@gmail.com>
>> > > > >>>>>>> > wrote:
>> > > > >>>>>>> > >
>> > > > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException
>> like
>> > > Java
>> > > > >>>>>>> > >> Collections when iterating past the end. Things look
>> > solid,
>> > > > >>>>>>> performance
>> > > > >>>>>>> > is
>> > > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the
>> entire
>> > > test
>> > > > >>>>>>> suites to
>> > > > >>>>>>> > run
>> > > > >>>>>>> > >> before submitting.
>> > > > >>>>>>> > >>
>> > > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 |
>> Google
>> > > > Inc.
>> > > > >>>>>>> > >>
>> > > > >>>>>>> > >>
>> > > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
>> > > > >>>>>>> robin.anil@gmail.com>
>> > > > >>>>>>> > wrote:
>> > > > >>>>>>> > >>
>> > > > >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract
>> > > > iterator
>> > > > >>>>>>> actually
>> > > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%.
>> Things
>> > > are
>> > > > >>>>>>> now 2x
>> > > > >>>>>>> > faster
>> > > > >>>>>>> > >>> than trunk. While also correcting the behavior (I
>> hope)
>> > > > >>>>>>> > >>>
>> > > > >>>>>>> > >>>
>> > > > >>>>>>> > >>>
>> > > > >>>>>>> >
>> > > > >>>>>>>
>> > > >
>> > >
>> >
>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>> > > > >>>>>>> > >>>
>> > > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 |
>> > Google
>> > > > Inc.
>> > > > >>>>>>> > >>>
>> > > > >>>>>>> > >>>
>> > > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
>> > > > >>>>>>> robin.anil@gmail.com
>> > > > >>>>>>> > >wrote:
>> > > > >>>>>>> > >>>
>> > > > >>>>>>> > >>>> Also note that this is code gen, I have to create
>> > > > >>>>>>> > Element$keyType$Value
>> > > > >>>>>>> > >>>> for each and every combination not just int double.
>> and
>> > > also
>> > > > >>>>>>> update
>> > > > >>>>>>> > all
>> > > > >>>>>>> > >>>> callers to user ElementIntDouble instead of Element.
>> Is
>> > it
>> > > > >>>>>>> worth it ?
>> > > > >>>>>>> > >>>>
>> > > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 |
>> > Google
>> > > > >>>>>>> Inc.
>> > > > >>>>>>> > >>>>
>> > > > >>>>>>> > >>>>
>> > > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>> > > > >>>>>>> ted.dunning@gmail.com
>> > > > >>>>>>> > >wrote:
>> > > > >>>>>>> > >>>>
>> > > > >>>>>>> > >>>>> Collections (no longer colt collections) are now
>> part
>> > of
>> > > > >>>>>>> mahout math.
>> > > > >>>>>>> > >>>>>  No
>> > > > >>>>>>> > >>>>> need to keep them separate.  The lower iterator can
>> > > > reference
>> > > > >>>>>>> > >>>>> Vector.Element
>> > > > >>>>>>> > >>>>>
>> > > > >>>>>>> > >>>>>
>> > > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>> > > > >>>>>>> robin.anil@gmail.com>
>> > > > >>>>>>> > >>>>> wrote:
>> > > > >>>>>>> > >>>>>
>> > > > >>>>>>> > >>>>> > I would have loved to but Element is a sub
>> interface
>> > in
>> > > > >>>>>>> Vector. If
>> > > > >>>>>>> > >>>>> we want
>> > > > >>>>>>> > >>>>> > to keep colt collections separate we have to keep
>> > this
>> > > > >>>>>>> separation.
>> > > > >>>>>>> > >>>>> >
>> > > > >>>>>>> > >>>>>
>> > > > >>>>>>> > >>>>
>> > > > >>>>>>> > >>>>
>> > > > >>>>>>> > >>>
>> > > > >>>>>>> > >>
>> > > > >>>>>>> > >
>> > > > >>>>>>> >
>> > > > >>>>>>>
>> > > > >>>>>>
>> > > > >>>>>>
>> > > > >>>>>
>> > > > >>>>
>> > > > >>>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > >   -jake
>> > >
>> >
>>
>>
>>
>> --
>>
>>   -jake
>>
>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
The point 3 is coming from the philosophy that all Vectors behave the same
way and numNonDefaultElements of a DenseVector is same as that of a
SparseVector. Eg, if PersonSimilarity relies upon it for document length,
it should be behave the same way.

The point 4 can be solved by killing the iterator interface entirely and
creating forEachNonZero(function()) method which will only call if the
element is nonzero.



On Mon, Apr 15, 2013 at 2:08 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > This is what I propose:
> >
> > 1) Allow setting value to zero while iterating (e.set(0.0)).
> >
>
> This is in addition to the fact that we already allow setting nonzero
> values
> while iterating, right?
>
>
> > 2) Do not allow callers to use vector.set(index, 0.0) during iterating).
> > This can cause re-hashing. (Can set a dirty bit in the hashmap during
> > rehash to throw a concurrent modified exception)
> >
>
> Agreed - this is a commonly accepted requirement: I think in fact we
> should pro-actively throw ConcurrentModificationException if someone
> tries to call vector.set / vector.assign while iterating.
>
>
> > 3) Update the numNonDefaultElements to iterate over the array to discount
> > 0.0 instead of returning the hashMap values.
> > 4) IterateNonZero may iterate over a few zeros if you did set the
> dimension
> > to 0. Most of the statistics code should handle 0 values correctly.
> >
>
> Yeah, are we really strict about getNumNonDefaultElements really always
> returning exactly the number of nonzeroes?  I was under the impression that
> for e.g. DenseVector, it would give the overal size, even if some were 0,
> and that it was basically tracking the amount of space the vector was
> taking
> up.  But I can see the argument that it really should return what it says
> it
> returns, if that is relied upon.
>
>
> >
> >
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Ah, this was the one corner case I was worried about - we do
> special-case
> > > setting to 0,
> > > as meaning remove from the hashmap, yes.
> > >
> > > What's the TL;DR of what you did to work around this?  Should we allow
> > > this?  Even
> > > if it's through the Vector.Element instance, should it be ok?  If so,
> how
> > > to handle?
> > >
> > >
> > > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >
> > > > I am adding the tests and updating the patch.
> > > >
> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > >
> > > >
> > > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <ro...@gmail.com>
> > > wrote:
> > > >
> > > > > You can re-iterate if the state is in iteration. But you cannot
> > write.
> > > > >
> > > > > This is what is happening:
> > > > >
> > > > > One of the values are becoming 0. So Vector tries to remove it from
> > the
> > > > > underlying hashmap. This changes the layout, if a vector has to be
> > > > mutated
> > > > > while iterating, we have to set 0 value in the hashmap and not
> remove
> > > it
> > > > > like what the Vector layer is doing. This adds another complexity,
> > the
> > > > > vector iterator has to deal with skipping over elements with 0
> value.
> > > > >
> > > > >
> > > > > Try this
> > > > >
> > > > > Create a vector of length 13 and set the following values.
> > > > >
> > > > >
> > > > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0,
> 1,
> > > 1,
> > > > >    2, 1 };
> > > > >    2.     for (int i = 0; i < val.length; ++i) {
> > > > >    3.       vector.set(i, val[i]);
> > > > >    4.     }
> > > > >
> > > > > Iterate again and while iterating set one of the values as zero.
> > > > >
> > > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> > > > dangeorge.filimon@gmail.com
> > > > > > wrote:
> > > > >
> > > > >> What kind of Vector is failing to set() in that code?
> > > > >>
> > > > >> About the state enum, what if (for whatever reason, not
> > > > >> multi-threaded-ness) there are multiple iterators to that vector?
> > > > >> Something like a reference count (how many iterators point to it)
> > > would
> > > > >> probably be needed, and keeping it sane would only be possible in
> > one
> > > > >> thread. Although this seems kind of brittle.
> > > > >>
> > > > >> +1 for numNonDefault.
> > > > >>
> > > > >>
> > > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <robin.anil@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >>> Another behavior difference.
> > > > >>>
> > > > >>> The numNonDefaultElement for a DenseVector returns the total
> > length.
> > > > >>> This causes Pearson Correlation Similarity to differ from if it
> was
> > > > >>> implemented using on of the SparseVector.
> > > > >>> I am proposing to fix the numNonDefaultElement to correctly
> iterate
> > > > over
> > > > >>> the dense vector to figure out non zero values ? Sounds ok
> > > > >>>
> > > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <
> robin.anil@gmail.com
> > > > >wrote:
> > > > >>>
> > > > >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate
> > the
> > > > >>>> object while iterating.
> > > > >>>>
> > > > >>>>
> > > > >>>>    1.     while (it.hasNext()) {
> > > > >>>>    2.       Vector.Element e = it.next();
> > > > >>>>    3.       *vector.set(e.index(),* e.get() - average);
> > > > >>>>    4.     }
> > > > >>>>
> > > > >>>> This has a side effect of causing the underlying hash-map or
> > object
> > > to
> > > > >>>> change.
> > > > >>>>
> > > > >>>> The right behavior is to set the value of the index while
> > iterating.
> > > > >>>>
> > > > >>>>    1.     while (it.hasNext()) {
> > > > >>>>    2.       Vector.Element e = it.next();
> > > > >>>>    3.       *e.set(e.get()* - average);
> > > > >>>>    4.     }
> > > > >>>>
> > > > >>>> I am sure we are incorrectly doing the first style across the
> code
> > > at
> > > > >>>> many places.
> > > > >>>>
> > > > >>>> I am proposing this
> > > > >>>>
> > > > >>>> When iterating, we lock the set interface on the vector using a
> > > State
> > > > >>>> enum. If anyone tries to mutate, we throw an exception.
> > > > >>>> We flip the state when we complete iterating (hasNext = false)
> or
> > > when
> > > > >>>> we explicitly close the iterator (adding a close method on the
> > > > iterator).
> > > > >>>>
> > > > >>>> Again this is all a single thread fix. if a vector is being
> > mutated
> > > > and
> > > > >>>> iterated across multiple threads, all hell can break loose.
> > > > >>>>
> > > > >>>> Robin
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <
> > robin.anil@gmail.com
> > > > >wrote:
> > > > >>>>
> > > > >>>>> Spoke too soon still failure.  I am uploading the latest patch.
> > > These
> > > > >>>>> are the current failing tests.
> > > > >>>>>
> > > > >>>>>
> > > >
> > >
> >
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > > > >>>>>
> > > > >>>>>
> > > >
> > >
> >
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > > > >>>>>
> > > > >>>>>
> > > >
> > >
> >
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > > > >>>>> null
> > > > >>>>>
> > > > >>>>>
> > > >
> > >
> >
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > > > >>>>> null
> > > > >>>>>
> > > > >>>>>
> > > >
> > >
> >
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> > > > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <
> > robin.anil@gmail.com
> > > > >wrote:
> > > > >>>>>
> > > > >>>>>> Found it, fixed it. I am submitting soon.
> > > > >>>>>>
> > > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> > > > ted.dunning@gmail.com>wrote:
> > > > >>>>>>
> > > > >>>>>>> Robin,
> > > > >>>>>>>
> > > > >>>>>>> Can you make sure that the patches are somewhere that Dan can
> > > pick
> > > > >>>>>>> up this
> > > > >>>>>>> work?  He is in GMT+2 and is probably about to appear on the
> > > scene.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
> > > robin.anil@gmail.com>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> > Strike that there are still failures. Investigating. if I
> > cant
> > > > fix
> > > > >>>>>>> it in
> > > > >>>>>>> > the next hour, I will submit them sometime in the evening
> > > > tomorrow.
> > > > >>>>>>> >
> > > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > Inc.
> > > > >>>>>>> >
> > > > >>>>>>> >
> > > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
> > > > robin.anil@gmail.com>
> > > > >>>>>>> wrote:
> > > > >>>>>>> >
> > > > >>>>>>> > > Tests pass. Submitting the patches.
> > > > >>>>>>> > >
> > > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 |
> Google
> > > Inc.
> > > > >>>>>>> > >
> > > > >>>>>>> > >
> > > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> > > > >>>>>>> robin.anil@gmail.com>
> > > > >>>>>>> > wrote:
> > > > >>>>>>> > >
> > > > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException
> like
> > > Java
> > > > >>>>>>> > >> Collections when iterating past the end. Things look
> > solid,
> > > > >>>>>>> performance
> > > > >>>>>>> > is
> > > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire
> > > test
> > > > >>>>>>> suites to
> > > > >>>>>>> > run
> > > > >>>>>>> > >> before submitting.
> > > > >>>>>>> > >>
> > > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 |
> Google
> > > > Inc.
> > > > >>>>>>> > >>
> > > > >>>>>>> > >>
> > > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> > > > >>>>>>> robin.anil@gmail.com>
> > > > >>>>>>> > wrote:
> > > > >>>>>>> > >>
> > > > >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract
> > > > iterator
> > > > >>>>>>> actually
> > > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%.
> Things
> > > are
> > > > >>>>>>> now 2x
> > > > >>>>>>> > faster
> > > > >>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
> > > > >>>>>>> > >>>
> > > > >>>>>>> > >>>
> > > > >>>>>>> > >>>
> > > > >>>>>>> >
> > > > >>>>>>>
> > > >
> > >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > > > >>>>>>> > >>>
> > > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 |
> > Google
> > > > Inc.
> > > > >>>>>>> > >>>
> > > > >>>>>>> > >>>
> > > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> > > > >>>>>>> robin.anil@gmail.com
> > > > >>>>>>> > >wrote:
> > > > >>>>>>> > >>>
> > > > >>>>>>> > >>>> Also note that this is code gen, I have to create
> > > > >>>>>>> > Element$keyType$Value
> > > > >>>>>>> > >>>> for each and every combination not just int double.
> and
> > > also
> > > > >>>>>>> update
> > > > >>>>>>> > all
> > > > >>>>>>> > >>>> callers to user ElementIntDouble instead of Element.
> Is
> > it
> > > > >>>>>>> worth it ?
> > > > >>>>>>> > >>>>
> > > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 |
> > Google
> > > > >>>>>>> Inc.
> > > > >>>>>>> > >>>>
> > > > >>>>>>> > >>>>
> > > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> > > > >>>>>>> ted.dunning@gmail.com
> > > > >>>>>>> > >wrote:
> > > > >>>>>>> > >>>>
> > > > >>>>>>> > >>>>> Collections (no longer colt collections) are now part
> > of
> > > > >>>>>>> mahout math.
> > > > >>>>>>> > >>>>>  No
> > > > >>>>>>> > >>>>> need to keep them separate.  The lower iterator can
> > > > reference
> > > > >>>>>>> > >>>>> Vector.Element
> > > > >>>>>>> > >>>>>
> > > > >>>>>>> > >>>>>
> > > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> > > > >>>>>>> robin.anil@gmail.com>
> > > > >>>>>>> > >>>>> wrote:
> > > > >>>>>>> > >>>>>
> > > > >>>>>>> > >>>>> > I would have loved to but Element is a sub
> interface
> > in
> > > > >>>>>>> Vector. If
> > > > >>>>>>> > >>>>> we want
> > > > >>>>>>> > >>>>> > to keep colt collections separate we have to keep
> > this
> > > > >>>>>>> separation.
> > > > >>>>>>> > >>>>> >
> > > > >>>>>>> > >>>>>
> > > > >>>>>>> > >>>>
> > > > >>>>>>> > >>>>
> > > > >>>>>>> > >>>
> > > > >>>>>>> > >>
> > > > >>>>>>> > >
> > > > >>>>>>> >
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > >   -jake
> > >
> >
>
>
>
> --
>
>   -jake
>

Re: Odd vector iteration behavior

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, Apr 15, 2013 at 11:58 AM, Robin Anil <ro...@gmail.com> wrote:

> This is what I propose:
>
> 1) Allow setting value to zero while iterating (e.set(0.0)).
>

This is in addition to the fact that we already allow setting nonzero values
while iterating, right?


> 2) Do not allow callers to use vector.set(index, 0.0) during iterating).
> This can cause re-hashing. (Can set a dirty bit in the hashmap during
> rehash to throw a concurrent modified exception)
>

Agreed - this is a commonly accepted requirement: I think in fact we
should pro-actively throw ConcurrentModificationException if someone
tries to call vector.set / vector.assign while iterating.


> 3) Update the numNonDefaultElements to iterate over the array to discount
> 0.0 instead of returning the hashMap values.
> 4) IterateNonZero may iterate over a few zeros if you did set the dimension
> to 0. Most of the statistics code should handle 0 values correctly.
>

Yeah, are we really strict about getNumNonDefaultElements really always
returning exactly the number of nonzeroes?  I was under the impression that
for e.g. DenseVector, it would give the overal size, even if some were 0,
and that it was basically tracking the amount of space the vector was taking
up.  But I can see the argument that it really should return what it says it
returns, if that is relied upon.


>
>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Ah, this was the one corner case I was worried about - we do special-case
> > setting to 0,
> > as meaning remove from the hashmap, yes.
> >
> > What's the TL;DR of what you did to work around this?  Should we allow
> > this?  Even
> > if it's through the Vector.Element instance, should it be ok?  If so, how
> > to handle?
> >
> >
> > On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > I am adding the tests and updating the patch.
> > >
> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >
> > >
> > > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >
> > > > You can re-iterate if the state is in iteration. But you cannot
> write.
> > > >
> > > > This is what is happening:
> > > >
> > > > One of the values are becoming 0. So Vector tries to remove it from
> the
> > > > underlying hashmap. This changes the layout, if a vector has to be
> > > mutated
> > > > while iterating, we have to set 0 value in the hashmap and not remove
> > it
> > > > like what the Vector layer is doing. This adds another complexity,
> the
> > > > vector iterator has to deal with skipping over elements with 0 value.
> > > >
> > > >
> > > > Try this
> > > >
> > > > Create a vector of length 13 and set the following values.
> > > >
> > > >
> > > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1,
> > 1,
> > > >    2, 1 };
> > > >    2.     for (int i = 0; i < val.length; ++i) {
> > > >    3.       vector.set(i, val[i]);
> > > >    4.     }
> > > >
> > > > Iterate again and while iterating set one of the values as zero.
> > > >
> > > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> > > dangeorge.filimon@gmail.com
> > > > > wrote:
> > > >
> > > >> What kind of Vector is failing to set() in that code?
> > > >>
> > > >> About the state enum, what if (for whatever reason, not
> > > >> multi-threaded-ness) there are multiple iterators to that vector?
> > > >> Something like a reference count (how many iterators point to it)
> > would
> > > >> probably be needed, and keeping it sane would only be possible in
> one
> > > >> thread. Although this seems kind of brittle.
> > > >>
> > > >> +1 for numNonDefault.
> > > >>
> > > >>
> > > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <ro...@gmail.com>
> > > wrote:
> > > >>
> > > >>> Another behavior difference.
> > > >>>
> > > >>> The numNonDefaultElement for a DenseVector returns the total
> length.
> > > >>> This causes Pearson Correlation Similarity to differ from if it was
> > > >>> implemented using on of the SparseVector.
> > > >>> I am proposing to fix the numNonDefaultElement to correctly iterate
> > > over
> > > >>> the dense vector to figure out non zero values ? Sounds ok
> > > >>>
> > > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > >>>
> > > >>>
> > > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <robin.anil@gmail.com
> > > >wrote:
> > > >>>
> > > >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate
> the
> > > >>>> object while iterating.
> > > >>>>
> > > >>>>
> > > >>>>    1.     while (it.hasNext()) {
> > > >>>>    2.       Vector.Element e = it.next();
> > > >>>>    3.       *vector.set(e.index(),* e.get() - average);
> > > >>>>    4.     }
> > > >>>>
> > > >>>> This has a side effect of causing the underlying hash-map or
> object
> > to
> > > >>>> change.
> > > >>>>
> > > >>>> The right behavior is to set the value of the index while
> iterating.
> > > >>>>
> > > >>>>    1.     while (it.hasNext()) {
> > > >>>>    2.       Vector.Element e = it.next();
> > > >>>>    3.       *e.set(e.get()* - average);
> > > >>>>    4.     }
> > > >>>>
> > > >>>> I am sure we are incorrectly doing the first style across the code
> > at
> > > >>>> many places.
> > > >>>>
> > > >>>> I am proposing this
> > > >>>>
> > > >>>> When iterating, we lock the set interface on the vector using a
> > State
> > > >>>> enum. If anyone tries to mutate, we throw an exception.
> > > >>>> We flip the state when we complete iterating (hasNext = false) or
> > when
> > > >>>> we explicitly close the iterator (adding a close method on the
> > > iterator).
> > > >>>>
> > > >>>> Again this is all a single thread fix. if a vector is being
> mutated
> > > and
> > > >>>> iterated across multiple threads, all hell can break loose.
> > > >>>>
> > > >>>> Robin
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <
> robin.anil@gmail.com
> > > >wrote:
> > > >>>>
> > > >>>>> Spoke too soon still failure.  I am uploading the latest patch.
> > These
> > > >>>>> are the current failing tests.
> > > >>>>>
> > > >>>>>
> > >
> >
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > > >>>>>
> > > >>>>>
> > >
> >
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > > >>>>>
> > > >>>>>
> > >
> >
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > > >>>>> null
> > > >>>>>
> > > >>>>>
> > >
> >
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > > >>>>> null
> > > >>>>>
> > > >>>>>
> > >
> >
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> > > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
> > > >>>>>
> > > >>>>>
> > > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > >>>>>
> > > >>>>>
> > > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <
> robin.anil@gmail.com
> > > >wrote:
> > > >>>>>
> > > >>>>>> Found it, fixed it. I am submitting soon.
> > > >>>>>>
> > > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> > > ted.dunning@gmail.com>wrote:
> > > >>>>>>
> > > >>>>>>> Robin,
> > > >>>>>>>
> > > >>>>>>> Can you make sure that the patches are somewhere that Dan can
> > pick
> > > >>>>>>> up this
> > > >>>>>>> work?  He is in GMT+2 and is probably about to appear on the
> > scene.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
> > robin.anil@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> > Strike that there are still failures. Investigating. if I
> cant
> > > fix
> > > >>>>>>> it in
> > > >>>>>>> > the next hour, I will submit them sometime in the evening
> > > tomorrow.
> > > >>>>>>> >
> > > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> > > >>>>>>> >
> > > >>>>>>> >
> > > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
> > > robin.anil@gmail.com>
> > > >>>>>>> wrote:
> > > >>>>>>> >
> > > >>>>>>> > > Tests pass. Submitting the patches.
> > > >>>>>>> > >
> > > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > Inc.
> > > >>>>>>> > >
> > > >>>>>>> > >
> > > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> > > >>>>>>> robin.anil@gmail.com>
> > > >>>>>>> > wrote:
> > > >>>>>>> > >
> > > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException like
> > Java
> > > >>>>>>> > >> Collections when iterating past the end. Things look
> solid,
> > > >>>>>>> performance
> > > >>>>>>> > is
> > > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire
> > test
> > > >>>>>>> suites to
> > > >>>>>>> > run
> > > >>>>>>> > >> before submitting.
> > > >>>>>>> > >>
> > > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > > Inc.
> > > >>>>>>> > >>
> > > >>>>>>> > >>
> > > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> > > >>>>>>> robin.anil@gmail.com>
> > > >>>>>>> > wrote:
> > > >>>>>>> > >>
> > > >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract
> > > iterator
> > > >>>>>>> actually
> > > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things
> > are
> > > >>>>>>> now 2x
> > > >>>>>>> > faster
> > > >>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
> > > >>>>>>> > >>>
> > > >>>>>>> > >>>
> > > >>>>>>> > >>>
> > > >>>>>>> >
> > > >>>>>>>
> > >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > > >>>>>>> > >>>
> > > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 |
> Google
> > > Inc.
> > > >>>>>>> > >>>
> > > >>>>>>> > >>>
> > > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> > > >>>>>>> robin.anil@gmail.com
> > > >>>>>>> > >wrote:
> > > >>>>>>> > >>>
> > > >>>>>>> > >>>> Also note that this is code gen, I have to create
> > > >>>>>>> > Element$keyType$Value
> > > >>>>>>> > >>>> for each and every combination not just int double. and
> > also
> > > >>>>>>> update
> > > >>>>>>> > all
> > > >>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is
> it
> > > >>>>>>> worth it ?
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 |
> Google
> > > >>>>>>> Inc.
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> > > >>>>>>> ted.dunning@gmail.com
> > > >>>>>>> > >wrote:
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>>> Collections (no longer colt collections) are now part
> of
> > > >>>>>>> mahout math.
> > > >>>>>>> > >>>>>  No
> > > >>>>>>> > >>>>> need to keep them separate.  The lower iterator can
> > > reference
> > > >>>>>>> > >>>>> Vector.Element
> > > >>>>>>> > >>>>>
> > > >>>>>>> > >>>>>
> > > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> > > >>>>>>> robin.anil@gmail.com>
> > > >>>>>>> > >>>>> wrote:
> > > >>>>>>> > >>>>>
> > > >>>>>>> > >>>>> > I would have loved to but Element is a sub interface
> in
> > > >>>>>>> Vector. If
> > > >>>>>>> > >>>>> we want
> > > >>>>>>> > >>>>> > to keep colt collections separate we have to keep
> this
> > > >>>>>>> separation.
> > > >>>>>>> > >>>>> >
> > > >>>>>>> > >>>>>
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>>
> > > >>>>>>> > >>>
> > > >>>>>>> > >>
> > > >>>>>>> > >
> > > >>>>>>> >
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
This is what I propose:

1) Allow setting value to zero while iterating (e.set(0.0)).
2) Do not allow callers to use vector.set(index, 0.0) during iterating).
This can cause re-hashing. (Can set a dirty bit in the hashmap during
rehash to throw a concurrent modified exception)
3) Update the numNonDefaultElements to iterate over the array to discount
0.0 instead of returning the hashMap values.
4) IterateNonZero may iterate over a few zeros if you did set the dimension
to 0. Most of the statistics code should handle 0 values correctly.



Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 1:50 PM, Jake Mannix <ja...@gmail.com> wrote:

> Ah, this was the one corner case I was worried about - we do special-case
> setting to 0,
> as meaning remove from the hashmap, yes.
>
> What's the TL;DR of what you did to work around this?  Should we allow
> this?  Even
> if it's through the Vector.Element instance, should it be ok?  If so, how
> to handle?
>
>
> On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > I am adding the tests and updating the patch.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > You can re-iterate if the state is in iteration. But you cannot write.
> > >
> > > This is what is happening:
> > >
> > > One of the values are becoming 0. So Vector tries to remove it from the
> > > underlying hashmap. This changes the layout, if a vector has to be
> > mutated
> > > while iterating, we have to set 0 value in the hashmap and not remove
> it
> > > like what the Vector layer is doing. This adds another complexity, the
> > > vector iterator has to deal with skipping over elements with 0 value.
> > >
> > >
> > > Try this
> > >
> > > Create a vector of length 13 and set the following values.
> > >
> > >
> > >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1,
> 1,
> > >    2, 1 };
> > >    2.     for (int i = 0; i < val.length; ++i) {
> > >    3.       vector.set(i, val[i]);
> > >    4.     }
> > >
> > > Iterate again and while iterating set one of the values as zero.
> > >
> > > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> > dangeorge.filimon@gmail.com
> > > > wrote:
> > >
> > >> What kind of Vector is failing to set() in that code?
> > >>
> > >> About the state enum, what if (for whatever reason, not
> > >> multi-threaded-ness) there are multiple iterators to that vector?
> > >> Something like a reference count (how many iterators point to it)
> would
> > >> probably be needed, and keeping it sane would only be possible in one
> > >> thread. Although this seems kind of brittle.
> > >>
> > >> +1 for numNonDefault.
> > >>
> > >>
> > >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >>
> > >>> Another behavior difference.
> > >>>
> > >>> The numNonDefaultElement for a DenseVector returns the total length.
> > >>> This causes Pearson Correlation Similarity to differ from if it was
> > >>> implemented using on of the SparseVector.
> > >>> I am proposing to fix the numNonDefaultElement to correctly iterate
> > over
> > >>> the dense vector to figure out non zero values ? Sounds ok
> > >>>
> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>
> > >>>
> > >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <robin.anil@gmail.com
> > >wrote:
> > >>>
> > >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
> > >>>> object while iterating.
> > >>>>
> > >>>>
> > >>>>    1.     while (it.hasNext()) {
> > >>>>    2.       Vector.Element e = it.next();
> > >>>>    3.       *vector.set(e.index(),* e.get() - average);
> > >>>>    4.     }
> > >>>>
> > >>>> This has a side effect of causing the underlying hash-map or object
> to
> > >>>> change.
> > >>>>
> > >>>> The right behavior is to set the value of the index while iterating.
> > >>>>
> > >>>>    1.     while (it.hasNext()) {
> > >>>>    2.       Vector.Element e = it.next();
> > >>>>    3.       *e.set(e.get()* - average);
> > >>>>    4.     }
> > >>>>
> > >>>> I am sure we are incorrectly doing the first style across the code
> at
> > >>>> many places.
> > >>>>
> > >>>> I am proposing this
> > >>>>
> > >>>> When iterating, we lock the set interface on the vector using a
> State
> > >>>> enum. If anyone tries to mutate, we throw an exception.
> > >>>> We flip the state when we complete iterating (hasNext = false) or
> when
> > >>>> we explicitly close the iterator (adding a close method on the
> > iterator).
> > >>>>
> > >>>> Again this is all a single thread fix. if a vector is being mutated
> > and
> > >>>> iterated across multiple threads, all hell can break loose.
> > >>>>
> > >>>> Robin
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <robin.anil@gmail.com
> > >wrote:
> > >>>>
> > >>>>> Spoke too soon still failure.  I am uploading the latest patch.
> These
> > >>>>> are the current failing tests.
> > >>>>>
> > >>>>>
> >
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > >>>>>
> > >>>>>
> >
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> > >>>>> not expecting cluster:{0:1.0,1:1.0}
> > >>>>>
> > >>>>>
> >
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > >>>>> null
> > >>>>>
> > >>>>>
> >
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> > >>>>> null
> > >>>>>
> > >>>>>
> >
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> > >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
> > >>>>>
> > >>>>>
> > >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>>>
> > >>>>>
> > >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <robin.anil@gmail.com
> > >wrote:
> > >>>>>
> > >>>>>> Found it, fixed it. I am submitting soon.
> > >>>>>>
> > >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> > ted.dunning@gmail.com>wrote:
> > >>>>>>
> > >>>>>>> Robin,
> > >>>>>>>
> > >>>>>>> Can you make sure that the patches are somewhere that Dan can
> pick
> > >>>>>>> up this
> > >>>>>>> work?  He is in GMT+2 and is probably about to appear on the
> scene.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <
> robin.anil@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> > Strike that there are still failures. Investigating. if I cant
> > fix
> > >>>>>>> it in
> > >>>>>>> > the next hour, I will submit them sometime in the evening
> > tomorrow.
> > >>>>>>> >
> > >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
> > robin.anil@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>> >
> > >>>>>>> > > Tests pass. Submitting the patches.
> > >>>>>>> > >
> > >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> > >>>>>>> > >
> > >>>>>>> > >
> > >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> > >>>>>>> robin.anil@gmail.com>
> > >>>>>>> > wrote:
> > >>>>>>> > >
> > >>>>>>> > >> Added a few more tests. Throw NoSuchElementException like
> Java
> > >>>>>>> > >> Collections when iterating past the end. Things look solid,
> > >>>>>>> performance
> > >>>>>>> > is
> > >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire
> test
> > >>>>>>> suites to
> > >>>>>>> > run
> > >>>>>>> > >> before submitting.
> > >>>>>>> > >>
> > >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > Inc.
> > >>>>>>> > >>
> > >>>>>>> > >>
> > >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> > >>>>>>> robin.anil@gmail.com>
> > >>>>>>> > wrote:
> > >>>>>>> > >>
> > >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract
> > iterator
> > >>>>>>> actually
> > >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things
> are
> > >>>>>>> now 2x
> > >>>>>>> > faster
> > >>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
> > >>>>>>> > >>>
> > >>>>>>> > >>>
> > >>>>>>> > >>>
> > >>>>>>> >
> > >>>>>>>
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > >>>>>>> > >>>
> > >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > Inc.
> > >>>>>>> > >>>
> > >>>>>>> > >>>
> > >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> > >>>>>>> robin.anil@gmail.com
> > >>>>>>> > >wrote:
> > >>>>>>> > >>>
> > >>>>>>> > >>>> Also note that this is code gen, I have to create
> > >>>>>>> > Element$keyType$Value
> > >>>>>>> > >>>> for each and every combination not just int double. and
> also
> > >>>>>>> update
> > >>>>>>> > all
> > >>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
> > >>>>>>> worth it ?
> > >>>>>>> > >>>>
> > >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> > >>>>>>> Inc.
> > >>>>>>> > >>>>
> > >>>>>>> > >>>>
> > >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> > >>>>>>> ted.dunning@gmail.com
> > >>>>>>> > >wrote:
> > >>>>>>> > >>>>
> > >>>>>>> > >>>>> Collections (no longer colt collections) are now part of
> > >>>>>>> mahout math.
> > >>>>>>> > >>>>>  No
> > >>>>>>> > >>>>> need to keep them separate.  The lower iterator can
> > reference
> > >>>>>>> > >>>>> Vector.Element
> > >>>>>>> > >>>>>
> > >>>>>>> > >>>>>
> > >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> > >>>>>>> robin.anil@gmail.com>
> > >>>>>>> > >>>>> wrote:
> > >>>>>>> > >>>>>
> > >>>>>>> > >>>>> > I would have loved to but Element is a sub interface in
> > >>>>>>> Vector. If
> > >>>>>>> > >>>>> we want
> > >>>>>>> > >>>>> > to keep colt collections separate we have to keep this
> > >>>>>>> separation.
> > >>>>>>> > >>>>> >
> > >>>>>>> > >>>>>
> > >>>>>>> > >>>>
> > >>>>>>> > >>>>
> > >>>>>>> > >>>
> > >>>>>>> > >>
> > >>>>>>> > >
> > >>>>>>> >
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >
>
>
>
> --
>
>   -jake
>

Re: Odd vector iteration behavior

Posted by Jake Mannix <ja...@gmail.com>.
Ah, this was the one corner case I was worried about - we do special-case
setting to 0,
as meaning remove from the hashmap, yes.

What's the TL;DR of what you did to work around this?  Should we allow
this?  Even
if it's through the Vector.Element instance, should it be ok?  If so, how
to handle?


On Mon, Apr 15, 2013 at 11:04 AM, Robin Anil <ro...@gmail.com> wrote:

> I am adding the tests and updating the patch.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > You can re-iterate if the state is in iteration. But you cannot write.
> >
> > This is what is happening:
> >
> > One of the values are becoming 0. So Vector tries to remove it from the
> > underlying hashmap. This changes the layout, if a vector has to be
> mutated
> > while iterating, we have to set 0 value in the hashmap and not remove it
> > like what the Vector layer is doing. This adds another complexity, the
> > vector iterator has to deal with skipping over elements with 0 value.
> >
> >
> > Try this
> >
> > Create a vector of length 13 and set the following values.
> >
> >
> >    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1, 1,
> >    2, 1 };
> >    2.     for (int i = 0; i < val.length; ++i) {
> >    3.       vector.set(i, val[i]);
> >    4.     }
> >
> > Iterate again and while iterating set one of the values as zero.
> >
> > On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <
> dangeorge.filimon@gmail.com
> > > wrote:
> >
> >> What kind of Vector is failing to set() in that code?
> >>
> >> About the state enum, what if (for whatever reason, not
> >> multi-threaded-ness) there are multiple iterators to that vector?
> >> Something like a reference count (how many iterators point to it) would
> >> probably be needed, and keeping it sane would only be possible in one
> >> thread. Although this seems kind of brittle.
> >>
> >> +1 for numNonDefault.
> >>
> >>
> >> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >>
> >>> Another behavior difference.
> >>>
> >>> The numNonDefaultElement for a DenseVector returns the total length.
> >>> This causes Pearson Correlation Similarity to differ from if it was
> >>> implemented using on of the SparseVector.
> >>> I am proposing to fix the numNonDefaultElement to correctly iterate
> over
> >>> the dense vector to figure out non zero values ? Sounds ok
> >>>
> >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>
> >>>
> >>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <robin.anil@gmail.com
> >wrote:
> >>>
> >>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
> >>>> object while iterating.
> >>>>
> >>>>
> >>>>    1.     while (it.hasNext()) {
> >>>>    2.       Vector.Element e = it.next();
> >>>>    3.       *vector.set(e.index(),* e.get() - average);
> >>>>    4.     }
> >>>>
> >>>> This has a side effect of causing the underlying hash-map or object to
> >>>> change.
> >>>>
> >>>> The right behavior is to set the value of the index while iterating.
> >>>>
> >>>>    1.     while (it.hasNext()) {
> >>>>    2.       Vector.Element e = it.next();
> >>>>    3.       *e.set(e.get()* - average);
> >>>>    4.     }
> >>>>
> >>>> I am sure we are incorrectly doing the first style across the code at
> >>>> many places.
> >>>>
> >>>> I am proposing this
> >>>>
> >>>> When iterating, we lock the set interface on the vector using a State
> >>>> enum. If anyone tries to mutate, we throw an exception.
> >>>> We flip the state when we complete iterating (hasNext = false) or when
> >>>> we explicitly close the iterator (adding a close method on the
> iterator).
> >>>>
> >>>> Again this is all a single thread fix. if a vector is being mutated
> and
> >>>> iterated across multiple threads, all hell can break loose.
> >>>>
> >>>> Robin
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <robin.anil@gmail.com
> >wrote:
> >>>>
> >>>>> Spoke too soon still failure.  I am uploading the latest patch. These
> >>>>> are the current failing tests.
> >>>>>
> >>>>>
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> >>>>> not expecting cluster:{0:1.0,1:1.0}
> >>>>>
> >>>>>
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> >>>>> not expecting cluster:{0:1.0,1:1.0}
> >>>>>
> >>>>>
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> >>>>> null
> >>>>>
> >>>>>
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> >>>>> null
> >>>>>
> >>>>>
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> >>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
> >>>>>
> >>>>>
> >>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <robin.anil@gmail.com
> >wrote:
> >>>>>
> >>>>>> Found it, fixed it. I am submitting soon.
> >>>>>>
> >>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <
> ted.dunning@gmail.com>wrote:
> >>>>>>
> >>>>>>> Robin,
> >>>>>>>
> >>>>>>> Can you make sure that the patches are somewhere that Dan can pick
> >>>>>>> up this
> >>>>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> > Strike that there are still failures. Investigating. if I cant
> fix
> >>>>>>> it in
> >>>>>>> > the next hour, I will submit them sometime in the evening
> tomorrow.
> >>>>>>> >
> >>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <
> robin.anil@gmail.com>
> >>>>>>> wrote:
> >>>>>>> >
> >>>>>>> > > Tests pass. Submitting the patches.
> >>>>>>> > >
> >>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>>>> > >
> >>>>>>> > >
> >>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> >>>>>>> robin.anil@gmail.com>
> >>>>>>> > wrote:
> >>>>>>> > >
> >>>>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
> >>>>>>> > >> Collections when iterating past the end. Things look solid,
> >>>>>>> performance
> >>>>>>> > is
> >>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
> >>>>>>> suites to
> >>>>>>> > run
> >>>>>>> > >> before submitting.
> >>>>>>> > >>
> >>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> >>>>>>> > >>
> >>>>>>> > >>
> >>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> >>>>>>> robin.anil@gmail.com>
> >>>>>>> > wrote:
> >>>>>>> > >>
> >>>>>>> > >>> I am not sure what I did. But removing Guava Abstract
> iterator
> >>>>>>> actually
> >>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are
> >>>>>>> now 2x
> >>>>>>> > faster
> >>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
> >>>>>>> > >>>
> >>>>>>> > >>>
> >>>>>>> > >>>
> >>>>>>> >
> >>>>>>>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> >>>>>>> > >>>
> >>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> Inc.
> >>>>>>> > >>>
> >>>>>>> > >>>
> >>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> >>>>>>> robin.anil@gmail.com
> >>>>>>> > >wrote:
> >>>>>>> > >>>
> >>>>>>> > >>>> Also note that this is code gen, I have to create
> >>>>>>> > Element$keyType$Value
> >>>>>>> > >>>> for each and every combination not just int double. and also
> >>>>>>> update
> >>>>>>> > all
> >>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
> >>>>>>> worth it ?
> >>>>>>> > >>>>
> >>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
> >>>>>>> Inc.
> >>>>>>> > >>>>
> >>>>>>> > >>>>
> >>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> >>>>>>> ted.dunning@gmail.com
> >>>>>>> > >wrote:
> >>>>>>> > >>>>
> >>>>>>> > >>>>> Collections (no longer colt collections) are now part of
> >>>>>>> mahout math.
> >>>>>>> > >>>>>  No
> >>>>>>> > >>>>> need to keep them separate.  The lower iterator can
> reference
> >>>>>>> > >>>>> Vector.Element
> >>>>>>> > >>>>>
> >>>>>>> > >>>>>
> >>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> >>>>>>> robin.anil@gmail.com>
> >>>>>>> > >>>>> wrote:
> >>>>>>> > >>>>>
> >>>>>>> > >>>>> > I would have loved to but Element is a sub interface in
> >>>>>>> Vector. If
> >>>>>>> > >>>>> we want
> >>>>>>> > >>>>> > to keep colt collections separate we have to keep this
> >>>>>>> separation.
> >>>>>>> > >>>>> >
> >>>>>>> > >>>>>
> >>>>>>> > >>>>
> >>>>>>> > >>>>
> >>>>>>> > >>>
> >>>>>>> > >>
> >>>>>>> > >
> >>>>>>> >
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>



-- 

  -jake

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
I am adding the tests and updating the patch.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 1:03 PM, Robin Anil <ro...@gmail.com> wrote:

> You can re-iterate if the state is in iteration. But you cannot write.
>
> This is what is happening:
>
> One of the values are becoming 0. So Vector tries to remove it from the
> underlying hashmap. This changes the layout, if a vector has to be mutated
> while iterating, we have to set 0 value in the hashmap and not remove it
> like what the Vector layer is doing. This adds another complexity, the
> vector iterator has to deal with skipping over elements with 0 value.
>
>
> Try this
>
> Create a vector of length 13 and set the following values.
>
>
>    1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1, 1,
>    2, 1 };
>    2.     for (int i = 0; i < val.length; ++i) {
>    3.       vector.set(i, val[i]);
>    4.     }
>
> Iterate again and while iterating set one of the values as zero.
>
> On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon <dangeorge.filimon@gmail.com
> > wrote:
>
>> What kind of Vector is failing to set() in that code?
>>
>> About the state enum, what if (for whatever reason, not
>> multi-threaded-ness) there are multiple iterators to that vector?
>> Something like a reference count (how many iterators point to it) would
>> probably be needed, and keeping it sane would only be possible in one
>> thread. Although this seems kind of brittle.
>>
>> +1 for numNonDefault.
>>
>>
>> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <ro...@gmail.com> wrote:
>>
>>> Another behavior difference.
>>>
>>> The numNonDefaultElement for a DenseVector returns the total length.
>>> This causes Pearson Correlation Similarity to differ from if it was
>>> implemented using on of the SparseVector.
>>> I am proposing to fix the numNonDefaultElement to correctly iterate over
>>> the dense vector to figure out non zero values ? Sounds ok
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <ro...@gmail.com>wrote:
>>>
>>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
>>>> object while iterating.
>>>>
>>>>
>>>>    1.     while (it.hasNext()) {
>>>>    2.       Vector.Element e = it.next();
>>>>    3.       *vector.set(e.index(),* e.get() - average);
>>>>    4.     }
>>>>
>>>> This has a side effect of causing the underlying hash-map or object to
>>>> change.
>>>>
>>>> The right behavior is to set the value of the index while iterating.
>>>>
>>>>    1.     while (it.hasNext()) {
>>>>    2.       Vector.Element e = it.next();
>>>>    3.       *e.set(e.get()* - average);
>>>>    4.     }
>>>>
>>>> I am sure we are incorrectly doing the first style across the code at
>>>> many places.
>>>>
>>>> I am proposing this
>>>>
>>>> When iterating, we lock the set interface on the vector using a State
>>>> enum. If anyone tries to mutate, we throw an exception.
>>>> We flip the state when we complete iterating (hasNext = false) or when
>>>> we explicitly close the iterator (adding a close method on the iterator).
>>>>
>>>> Again this is all a single thread fix. if a vector is being mutated and
>>>> iterated across multiple threads, all hell can break loose.
>>>>
>>>> Robin
>>>>
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <ro...@gmail.com>wrote:
>>>>
>>>>> Spoke too soon still failure.  I am uploading the latest patch. These
>>>>> are the current failing tests.
>>>>>
>>>>>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>>>> not expecting cluster:{0:1.0,1:1.0}
>>>>>
>>>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>>>> not expecting cluster:{0:1.0,1:1.0}
>>>>>
>>>>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>>>> null
>>>>>
>>>>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>>>> null
>>>>>
>>>>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>>>>>
>>>>>
>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>
>>>>>
>>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <ro...@gmail.com>wrote:
>>>>>
>>>>>> Found it, fixed it. I am submitting soon.
>>>>>>
>>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <te...@gmail.com>wrote:
>>>>>>
>>>>>>> Robin,
>>>>>>>
>>>>>>> Can you make sure that the patches are somewhere that Dan can pick
>>>>>>> up this
>>>>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > Strike that there are still failures. Investigating. if I cant fix
>>>>>>> it in
>>>>>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>>>>>> >
>>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>> >
>>>>>>> >
>>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > > Tests pass. Submitting the patches.
>>>>>>> > >
>>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>> > >
>>>>>>> > >
>>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
>>>>>>> robin.anil@gmail.com>
>>>>>>> > wrote:
>>>>>>> > >
>>>>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>>>>>> > >> Collections when iterating past the end. Things look solid,
>>>>>>> performance
>>>>>>> > is
>>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>>>>>> suites to
>>>>>>> > run
>>>>>>> > >> before submitting.
>>>>>>> > >>
>>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>> > >>
>>>>>>> > >>
>>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
>>>>>>> robin.anil@gmail.com>
>>>>>>> > wrote:
>>>>>>> > >>
>>>>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>>>>>> actually
>>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are
>>>>>>> now 2x
>>>>>>> > faster
>>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> >
>>>>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>>>>> > >>>
>>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>>> > >>>
>>>>>>> > >>>
>>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
>>>>>>> robin.anil@gmail.com
>>>>>>> > >wrote:
>>>>>>> > >>>
>>>>>>> > >>>> Also note that this is code gen, I have to create
>>>>>>> > Element$keyType$Value
>>>>>>> > >>>> for each and every combination not just int double. and also
>>>>>>> update
>>>>>>> > all
>>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
>>>>>>> worth it ?
>>>>>>> > >>>>
>>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google
>>>>>>> Inc.
>>>>>>> > >>>>
>>>>>>> > >>>>
>>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>>>>>> ted.dunning@gmail.com
>>>>>>> > >wrote:
>>>>>>> > >>>>
>>>>>>> > >>>>> Collections (no longer colt collections) are now part of
>>>>>>> mahout math.
>>>>>>> > >>>>>  No
>>>>>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>>>>>> > >>>>> Vector.Element
>>>>>>> > >>>>>
>>>>>>> > >>>>>
>>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>>>>>> robin.anil@gmail.com>
>>>>>>> > >>>>> wrote:
>>>>>>> > >>>>>
>>>>>>> > >>>>> > I would have loved to but Element is a sub interface in
>>>>>>> Vector. If
>>>>>>> > >>>>> we want
>>>>>>> > >>>>> > to keep colt collections separate we have to keep this
>>>>>>> separation.
>>>>>>> > >>>>> >
>>>>>>> > >>>>>
>>>>>>> > >>>>
>>>>>>> > >>>>
>>>>>>> > >>>
>>>>>>> > >>
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
You can re-iterate if the state is in iteration. But you cannot write.

This is what is happening:

One of the values are becoming 0. So Vector tries to remove it from the
underlying hashmap. This changes the layout, if a vector has to be mutated
while iterating, we have to set 0 value in the hashmap and not remove it
like what the Vector layer is doing. This adds another complexity, the
vector iterator has to deal with skipping over elements with 0 value.


Try this

Create a vector of length 13 and set the following values.


   1.     double[] val = new double[] { 0, 2, 0, 0, 8, 3, 0, 6, 0, 1, 1, 2,
   1 };
   2.     for (int i = 0; i < val.length; ++i) {
   3.       vector.set(i, val[i]);
   4.     }

Iterate again and while iterating set one of the values as zero.

On Mon, Apr 15, 2013 at 12:56 PM, Dan Filimon
<da...@gmail.com>wrote:

> What kind of Vector is failing to set() in that code?
>
> About the state enum, what if (for whatever reason, not
> multi-threaded-ness) there are multiple iterators to that vector?
> Something like a reference count (how many iterators point to it) would
> probably be needed, and keeping it sane would only be possible in one
> thread. Although this seems kind of brittle.
>
> +1 for numNonDefault.
>
>
> On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> Another behavior difference.
>>
>> The numNonDefaultElement for a DenseVector returns the total length. This
>> causes Pearson Correlation Similarity to differ from if it was implemented
>> using on of the SparseVector.
>> I am proposing to fix the numNonDefaultElement to correctly iterate over
>> the dense vector to figure out non zero values ? Sounds ok
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <ro...@gmail.com>wrote:
>>
>>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
>>> object while iterating.
>>>
>>>
>>>    1.     while (it.hasNext()) {
>>>    2.       Vector.Element e = it.next();
>>>    3.       *vector.set(e.index(),* e.get() - average);
>>>    4.     }
>>>
>>> This has a side effect of causing the underlying hash-map or object to
>>> change.
>>>
>>> The right behavior is to set the value of the index while iterating.
>>>
>>>    1.     while (it.hasNext()) {
>>>    2.       Vector.Element e = it.next();
>>>    3.       *e.set(e.get()* - average);
>>>    4.     }
>>>
>>> I am sure we are incorrectly doing the first style across the code at
>>> many places.
>>>
>>> I am proposing this
>>>
>>> When iterating, we lock the set interface on the vector using a State
>>> enum. If anyone tries to mutate, we throw an exception.
>>> We flip the state when we complete iterating (hasNext = false) or when
>>> we explicitly close the iterator (adding a close method on the iterator).
>>>
>>> Again this is all a single thread fix. if a vector is being mutated and
>>> iterated across multiple threads, all hell can break loose.
>>>
>>> Robin
>>>
>>>
>>>
>>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <ro...@gmail.com>wrote:
>>>
>>>> Spoke too soon still failure.  I am uploading the latest patch. These
>>>> are the current failing tests.
>>>>
>>>>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>>> not expecting cluster:{0:1.0,1:1.0}
>>>>
>>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>>> not expecting cluster:{0:1.0,1:1.0}
>>>>
>>>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>>> null
>>>>
>>>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>>> null
>>>>
>>>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>>>>
>>>>
>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>
>>>>
>>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <ro...@gmail.com>wrote:
>>>>
>>>>> Found it, fixed it. I am submitting soon.
>>>>>
>>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>
>>>>>
>>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <te...@gmail.com>wrote:
>>>>>
>>>>>> Robin,
>>>>>>
>>>>>> Can you make sure that the patches are somewhere that Dan can pick up
>>>>>> this
>>>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > Strike that there are still failures. Investigating. if I cant fix
>>>>>> it in
>>>>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>>>>> >
>>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>> >
>>>>>> >
>>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > > Tests pass. Submitting the patches.
>>>>>> > >
>>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>> > >
>>>>>> > >
>>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
>>>>>> robin.anil@gmail.com>
>>>>>> > wrote:
>>>>>> > >
>>>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>>>>> > >> Collections when iterating past the end. Things look solid,
>>>>>> performance
>>>>>> > is
>>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>>>>> suites to
>>>>>> > run
>>>>>> > >> before submitting.
>>>>>> > >>
>>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>> > >>
>>>>>> > >>
>>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
>>>>>> robin.anil@gmail.com>
>>>>>> > wrote:
>>>>>> > >>
>>>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>>>>> actually
>>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are
>>>>>> now 2x
>>>>>> > faster
>>>>>> > >>> than trunk. While also correcting the behavior (I hope)
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>>
>>>>>> >
>>>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>>>> > >>>
>>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
>>>>>> robin.anil@gmail.com
>>>>>> > >wrote:
>>>>>> > >>>
>>>>>> > >>>> Also note that this is code gen, I have to create
>>>>>> > Element$keyType$Value
>>>>>> > >>>> for each and every combination not just int double. and also
>>>>>> update
>>>>>> > all
>>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
>>>>>> worth it ?
>>>>>> > >>>>
>>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>>> > >>>>
>>>>>> > >>>>
>>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>>>>> ted.dunning@gmail.com
>>>>>> > >wrote:
>>>>>> > >>>>
>>>>>> > >>>>> Collections (no longer colt collections) are now part of
>>>>>> mahout math.
>>>>>> > >>>>>  No
>>>>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>>>>> > >>>>> Vector.Element
>>>>>> > >>>>>
>>>>>> > >>>>>
>>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>>>>> robin.anil@gmail.com>
>>>>>> > >>>>> wrote:
>>>>>> > >>>>>
>>>>>> > >>>>> > I would have loved to but Element is a sub interface in
>>>>>> Vector. If
>>>>>> > >>>>> we want
>>>>>> > >>>>> > to keep colt collections separate we have to keep this
>>>>>> separation.
>>>>>> > >>>>> >
>>>>>> > >>>>>
>>>>>> > >>>>
>>>>>> > >>>>
>>>>>> > >>>
>>>>>> > >>
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Odd vector iteration behavior

Posted by Dan Filimon <da...@gmail.com>.
What kind of Vector is failing to set() in that code?

About the state enum, what if (for whatever reason, not
multi-threaded-ness) there are multiple iterators to that vector?
Something like a reference count (how many iterators point to it) would
probably be needed, and keeping it sane would only be possible in one
thread. Although this seems kind of brittle.

+1 for numNonDefault.


On Mon, Apr 15, 2013 at 8:36 PM, Robin Anil <ro...@gmail.com> wrote:

> Another behavior difference.
>
> The numNonDefaultElement for a DenseVector returns the total length. This
> causes Pearson Correlation Similarity to differ from if it was implemented
> using on of the SparseVector.
> I am proposing to fix the numNonDefaultElement to correctly iterate over
> the dense vector to figure out non zero values ? Sounds ok
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> Found the bug PearsonCorrelationSimilarity was trying to mutate the
>> object while iterating.
>>
>>
>>    1.     while (it.hasNext()) {
>>    2.       Vector.Element e = it.next();
>>    3.       *vector.set(e.index(),* e.get() - average);
>>    4.     }
>>
>> This has a side effect of causing the underlying hash-map or object to
>> change.
>>
>> The right behavior is to set the value of the index while iterating.
>>
>>    1.     while (it.hasNext()) {
>>    2.       Vector.Element e = it.next();
>>    3.       *e.set(e.get()* - average);
>>    4.     }
>>
>> I am sure we are incorrectly doing the first style across the code at
>> many places.
>>
>> I am proposing this
>>
>> When iterating, we lock the set interface on the vector using a State
>> enum. If anyone tries to mutate, we throw an exception.
>> We flip the state when we complete iterating (hasNext = false) or when we
>> explicitly close the iterator (adding a close method on the iterator).
>>
>> Again this is all a single thread fix. if a vector is being mutated and
>> iterated across multiple threads, all hell can break loose.
>>
>> Robin
>>
>>
>>
>> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <ro...@gmail.com>wrote:
>>
>>> Spoke too soon still failure.  I am uploading the latest patch. These
>>> are the current failing tests.
>>>
>>>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>> not expecting cluster:{0:1.0,1:1.0}
>>>
>>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>>> not expecting cluster:{0:1.0,1:1.0}
>>>
>>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>> null
>>>
>>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>>> null
>>>
>>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>>>
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <ro...@gmail.com>wrote:
>>>
>>>> Found it, fixed it. I am submitting soon.
>>>>
>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>
>>>>
>>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <te...@gmail.com>wrote:
>>>>
>>>>> Robin,
>>>>>
>>>>> Can you make sure that the patches are somewhere that Dan can pick up
>>>>> this
>>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Strike that there are still failures. Investigating. if I cant fix
>>>>> it in
>>>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>>>> >
>>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> >
>>>>> >
>>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > > Tests pass. Submitting the patches.
>>>>> > >
>>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> > >
>>>>> > >
>>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <robin.anil@gmail.com
>>>>> >
>>>>> > wrote:
>>>>> > >
>>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>>>> > >> Collections when iterating past the end. Things look solid,
>>>>> performance
>>>>> > is
>>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>>>> suites to
>>>>> > run
>>>>> > >> before submitting.
>>>>> > >>
>>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> > >>
>>>>> > >>
>>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <robin.anil@gmail.com
>>>>> >
>>>>> > wrote:
>>>>> > >>
>>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>>>> actually
>>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are
>>>>> now 2x
>>>>> > faster
>>>>> > >>> than trunk. While also correcting the behavior (I hope)
>>>>> > >>>
>>>>> > >>>
>>>>> > >>>
>>>>> >
>>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>>> > >>>
>>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
>>>>> robin.anil@gmail.com
>>>>> > >wrote:
>>>>> > >>>
>>>>> > >>>> Also note that this is code gen, I have to create
>>>>> > Element$keyType$Value
>>>>> > >>>> for each and every combination not just int double. and also
>>>>> update
>>>>> > all
>>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
>>>>> worth it ?
>>>>> > >>>>
>>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>> > >>>>
>>>>> > >>>>
>>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>>>> ted.dunning@gmail.com
>>>>> > >wrote:
>>>>> > >>>>
>>>>> > >>>>> Collections (no longer colt collections) are now part of
>>>>> mahout math.
>>>>> > >>>>>  No
>>>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>>>> > >>>>> Vector.Element
>>>>> > >>>>>
>>>>> > >>>>>
>>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>>>> robin.anil@gmail.com>
>>>>> > >>>>> wrote:
>>>>> > >>>>>
>>>>> > >>>>> > I would have loved to but Element is a sub interface in
>>>>> Vector. If
>>>>> > >>>>> we want
>>>>> > >>>>> > to keep colt collections separate we have to keep this
>>>>> separation.
>>>>> > >>>>> >
>>>>> > >>>>>
>>>>> > >>>>
>>>>> > >>>>
>>>>> > >>>
>>>>> > >>
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Another behavior difference.

The numNonDefaultElement for a DenseVector returns the total length. This
causes Pearson Correlation Similarity to differ from if it was implemented
using on of the SparseVector.
I am proposing to fix the numNonDefaultElement to correctly iterate over
the dense vector to figure out non zero values ? Sounds ok

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <ro...@gmail.com> wrote:

> Found the bug PearsonCorrelationSimilarity was trying to mutate the
> object while iterating.
>
>
>    1.     while (it.hasNext()) {
>    2.       Vector.Element e = it.next();
>    3.       *vector.set(e.index(),* e.get() - average);
>    4.     }
>
> This has a side effect of causing the underlying hash-map or object to
> change.
>
> The right behavior is to set the value of the index while iterating.
>
>    1.     while (it.hasNext()) {
>    2.       Vector.Element e = it.next();
>    3.       *e.set(e.get()* - average);
>    4.     }
>
> I am sure we are incorrectly doing the first style across the code at many
> places.
>
> I am proposing this
>
> When iterating, we lock the set interface on the vector using a State
> enum. If anyone tries to mutate, we throw an exception.
> We flip the state when we complete iterating (hasNext = false) or when we
> explicitly close the iterator (adding a close method on the iterator).
>
> Again this is all a single thread fix. if a vector is being mutated and
> iterated across multiple threads, all hell can break loose.
>
> Robin
>
>
>
> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <ro...@gmail.com> wrote:
>
>> Spoke too soon still failure.  I am uploading the latest patch. These are
>> the current failing tests.
>>
>>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> not expecting cluster:{0:1.0,1:1.0}
>>
>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> not expecting cluster:{0:1.0,1:1.0}
>>
>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> null
>>
>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> null
>>
>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>>
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <ro...@gmail.com>wrote:
>>
>>> Found it, fixed it. I am submitting soon.
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <te...@gmail.com>wrote:
>>>
>>>> Robin,
>>>>
>>>> Can you make sure that the patches are somewhere that Dan can pick up
>>>> this
>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>>
>>>>
>>>>
>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com>
>>>> wrote:
>>>>
>>>> > Strike that there are still failures. Investigating. if I cant fix it
>>>> in
>>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>>> >
>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> >
>>>> >
>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > Tests pass. Submitting the patches.
>>>> > >
>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >
>>>> > >
>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <ro...@gmail.com>
>>>> > wrote:
>>>> > >
>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>>> > >> Collections when iterating past the end. Things look solid,
>>>> performance
>>>> > is
>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>>> suites to
>>>> > run
>>>> > >> before submitting.
>>>> > >>
>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>
>>>> > >>
>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com>
>>>> > wrote:
>>>> > >>
>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>>> actually
>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are now
>>>> 2x
>>>> > faster
>>>> > >>> than trunk. While also correcting the behavior (I hope)
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> >
>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>> > >>>
>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>>
>>>> > >>>
>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <robin.anil@gmail.com
>>>> > >wrote:
>>>> > >>>
>>>> > >>>> Also note that this is code gen, I have to create
>>>> > Element$keyType$Value
>>>> > >>>> for each and every combination not just int double. and also
>>>> update
>>>> > all
>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it worth
>>>> it ?
>>>> > >>>>
>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>>> ted.dunning@gmail.com
>>>> > >wrote:
>>>> > >>>>
>>>> > >>>>> Collections (no longer colt collections) are now part of mahout
>>>> math.
>>>> > >>>>>  No
>>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>>> > >>>>> Vector.Element
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>>> robin.anil@gmail.com>
>>>> > >>>>> wrote:
>>>> > >>>>>
>>>> > >>>>> > I would have loved to but Element is a sub interface in
>>>> Vector. If
>>>> > >>>>> we want
>>>> > >>>>> > to keep colt collections separate we have to keep this
>>>> separation.
>>>> > >>>>> >
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>>
>>>> > >>>
>>>> > >>
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Odd vector iteration behavior

Posted by Jake Mannix <ja...@gmail.com>.
It should be pretty easy to check via a new unit test if this iteration /
changing
values interleaved operation works.  It's hard to tell
if indexOfInsertion() is
implemented completely safely by inspection.


On Mon, Apr 15, 2013 at 10:50 AM, Robin Anil <ro...@gmail.com> wrote:

> On second thought both should work. The first method should not mutate if
> the element already exists. Now I am scared, this sounds to me like a bug
> in the OpenIntDoubleHashMap implementation.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Found the bug PearsonCorrelationSimilarity was trying to mutate the
> > object while iterating.
> >
> >
> >    1.     while (it.hasNext()) {
> >    2.       Vector.Element e = it.next();
> >    3.       *vector.set(e.index(),* e.get() - average);
> >    4.     }
> >
> > This has a side effect of causing the underlying hash-map or object to
> > change.
> >
> > The right behavior is to set the value of the index while iterating.
> >
> >    1.     while (it.hasNext()) {
> >    2.       Vector.Element e = it.next();
> >    3.       *e.set(e.get()* - average);
> >    4.     }
> >
> > I am sure we are incorrectly doing the first style across the code at
> many
> > places.
> >
> > I am proposing this
> >
> > When iterating, we lock the set interface on the vector using a State
> > enum. If anyone tries to mutate, we throw an exception.
> > We flip the state when we complete iterating (hasNext = false) or when we
> > explicitly close the iterator (adding a close method on the iterator).
> >
> > Again this is all a single thread fix. if a vector is being mutated and
> > iterated across multiple threads, all hell can break loose.
> >
> > Robin
> >
> >
> >
> > On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> >> Spoke too soon still failure.  I am uploading the latest patch. These
> are
> >> the current failing tests.
> >>
> >>
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> >> not expecting cluster:{0:1.0,1:1.0}
> >>
> >>
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> >> not expecting cluster:{0:1.0,1:1.0}
> >>
> >>
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> >> null
> >>
> >>
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> >> null
> >>
> >>
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> >> expected:<0.5303300858899108> but was:<0.38729833462074176>
> >>
> >>
> >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>
> >>
> >> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <robin.anil@gmail.com
> >wrote:
> >>
> >>> Found it, fixed it. I am submitting soon.
> >>>
> >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>
> >>>
> >>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >>>
> >>>> Robin,
> >>>>
> >>>> Can you make sure that the patches are somewhere that Dan can pick up
> >>>> this
> >>>> work?  He is in GMT+2 and is probably about to appear on the scene.
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> > Strike that there are still failures. Investigating. if I cant fix
> it
> >>>> in
> >>>> > the next hour, I will submit them sometime in the evening tomorrow.
> >>>> >
> >>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>> >
> >>>> >
> >>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > > Tests pass. Submitting the patches.
> >>>> > >
> >>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>> > >
> >>>> > >
> >>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <
> robin.anil@gmail.com>
> >>>> > wrote:
> >>>> > >
> >>>> > >> Added a few more tests. Throw NoSuchElementException like Java
> >>>> > >> Collections when iterating past the end. Things look solid,
> >>>> performance
> >>>> > is
> >>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
> >>>> suites to
> >>>> > run
> >>>> > >> before submitting.
> >>>> > >>
> >>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>> > >>
> >>>> > >>
> >>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <
> robin.anil@gmail.com>
> >>>> > wrote:
> >>>> > >>
> >>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
> >>>> actually
> >>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are
> now
> >>>> 2x
> >>>> > faster
> >>>> > >>> than trunk. While also correcting the behavior (I hope)
> >>>> > >>>
> >>>> > >>>
> >>>> > >>>
> >>>> >
> >>>>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> >>>> > >>>
> >>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>> > >>>
> >>>> > >>>
> >>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <
> robin.anil@gmail.com
> >>>> > >wrote:
> >>>> > >>>
> >>>> > >>>> Also note that this is code gen, I have to create
> >>>> > Element$keyType$Value
> >>>> > >>>> for each and every combination not just int double. and also
> >>>> update
> >>>> > all
> >>>> > >>>> callers to user ElementIntDouble instead of Element. Is it
> worth
> >>>> it ?
> >>>> > >>>>
> >>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>> > >>>>
> >>>> > >>>>
> >>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
> >>>> ted.dunning@gmail.com
> >>>> > >wrote:
> >>>> > >>>>
> >>>> > >>>>> Collections (no longer colt collections) are now part of
> mahout
> >>>> math.
> >>>> > >>>>>  No
> >>>> > >>>>> need to keep them separate.  The lower iterator can reference
> >>>> > >>>>> Vector.Element
> >>>> > >>>>>
> >>>> > >>>>>
> >>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
> >>>> robin.anil@gmail.com>
> >>>> > >>>>> wrote:
> >>>> > >>>>>
> >>>> > >>>>> > I would have loved to but Element is a sub interface in
> >>>> Vector. If
> >>>> > >>>>> we want
> >>>> > >>>>> > to keep colt collections separate we have to keep this
> >>>> separation.
> >>>> > >>>>> >
> >>>> > >>>>>
> >>>> > >>>>
> >>>> > >>>>
> >>>> > >>>
> >>>> > >>
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>



-- 

  -jake

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
On second thought both should work. The first method should not mutate if
the element already exists. Now I am scared, this sounds to me like a bug
in the OpenIntDoubleHashMap implementation.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 12:32 PM, Robin Anil <ro...@gmail.com> wrote:

> Found the bug PearsonCorrelationSimilarity was trying to mutate the
> object while iterating.
>
>
>    1.     while (it.hasNext()) {
>    2.       Vector.Element e = it.next();
>    3.       *vector.set(e.index(),* e.get() - average);
>    4.     }
>
> This has a side effect of causing the underlying hash-map or object to
> change.
>
> The right behavior is to set the value of the index while iterating.
>
>    1.     while (it.hasNext()) {
>    2.       Vector.Element e = it.next();
>    3.       *e.set(e.get()* - average);
>    4.     }
>
> I am sure we are incorrectly doing the first style across the code at many
> places.
>
> I am proposing this
>
> When iterating, we lock the set interface on the vector using a State
> enum. If anyone tries to mutate, we throw an exception.
> We flip the state when we complete iterating (hasNext = false) or when we
> explicitly close the iterator (adding a close method on the iterator).
>
> Again this is all a single thread fix. if a vector is being mutated and
> iterated across multiple threads, all hell can break loose.
>
> Robin
>
>
>
> On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <ro...@gmail.com> wrote:
>
>> Spoke too soon still failure.  I am uploading the latest patch. These are
>> the current failing tests.
>>
>>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> not expecting cluster:{0:1.0,1:1.0}
>>
>> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
>> not expecting cluster:{0:1.0,1:1.0}
>>
>> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> null
>>
>> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
>> null
>>
>> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
>> expected:<0.5303300858899108> but was:<0.38729833462074176>
>>
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <ro...@gmail.com>wrote:
>>
>>> Found it, fixed it. I am submitting soon.
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <te...@gmail.com>wrote:
>>>
>>>> Robin,
>>>>
>>>> Can you make sure that the patches are somewhere that Dan can pick up
>>>> this
>>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>>
>>>>
>>>>
>>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com>
>>>> wrote:
>>>>
>>>> > Strike that there are still failures. Investigating. if I cant fix it
>>>> in
>>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>>> >
>>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> >
>>>> >
>>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > Tests pass. Submitting the patches.
>>>> > >
>>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >
>>>> > >
>>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <ro...@gmail.com>
>>>> > wrote:
>>>> > >
>>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>>> > >> Collections when iterating past the end. Things look solid,
>>>> performance
>>>> > is
>>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>>> suites to
>>>> > run
>>>> > >> before submitting.
>>>> > >>
>>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>
>>>> > >>
>>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com>
>>>> > wrote:
>>>> > >>
>>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>>> actually
>>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are now
>>>> 2x
>>>> > faster
>>>> > >>> than trunk. While also correcting the behavior (I hope)
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> >
>>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>> > >>>
>>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>>
>>>> > >>>
>>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <robin.anil@gmail.com
>>>> > >wrote:
>>>> > >>>
>>>> > >>>> Also note that this is code gen, I have to create
>>>> > Element$keyType$Value
>>>> > >>>> for each and every combination not just int double. and also
>>>> update
>>>> > all
>>>> > >>>> callers to user ElementIntDouble instead of Element. Is it worth
>>>> it ?
>>>> > >>>>
>>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>> > >>>>
>>>> > >>>>
>>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>>> ted.dunning@gmail.com
>>>> > >wrote:
>>>> > >>>>
>>>> > >>>>> Collections (no longer colt collections) are now part of mahout
>>>> math.
>>>> > >>>>>  No
>>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>>> > >>>>> Vector.Element
>>>> > >>>>>
>>>> > >>>>>
>>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>>> robin.anil@gmail.com>
>>>> > >>>>> wrote:
>>>> > >>>>>
>>>> > >>>>> > I would have loved to but Element is a sub interface in
>>>> Vector. If
>>>> > >>>>> we want
>>>> > >>>>> > to keep colt collections separate we have to keep this
>>>> separation.
>>>> > >>>>> >
>>>> > >>>>>
>>>> > >>>>
>>>> > >>>>
>>>> > >>>
>>>> > >>
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Found the bug PearsonCorrelationSimilarity was trying to mutate the object
while iterating.


   1.     while (it.hasNext()) {
   2.       Vector.Element e = it.next();
   3.       *vector.set(e.index(),* e.get() - average);
   4.     }

This has a side effect of causing the underlying hash-map or object to
change.

The right behavior is to set the value of the index while iterating.

   1.     while (it.hasNext()) {
   2.       Vector.Element e = it.next();
   3.       *e.set(e.get()* - average);
   4.     }

I am sure we are incorrectly doing the first style across the code at many
places.

I am proposing this

When iterating, we lock the set interface on the vector using a State enum.
If anyone tries to mutate, we throw an exception.
We flip the state when we complete iterating (hasNext = false) or when we
explicitly close the iterator (adding a close method on the iterator).

Again this is all a single thread fix. if a vector is being mutated and
iterated across multiple threads, all hell can break loose.

Robin



On Mon, Apr 15, 2013 at 12:56 AM, Robin Anil <ro...@gmail.com> wrote:

> Spoke too soon still failure.  I am uploading the latest patch. These are
> the current failing tests.
>
>  ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> not expecting cluster:{0:1.0,1:1.0}
>
> ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
> not expecting cluster:{0:1.0,1:1.0}
>
> ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> null
>
> ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
> null
>
> VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
> expected:<0.5303300858899108> but was:<0.38729833462074176>
>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <ro...@gmail.com> wrote:
>
>> Found it, fixed it. I am submitting soon.
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> Robin,
>>>
>>> Can you make sure that the patches are somewhere that Dan can pick up
>>> this
>>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>>
>>>
>>>
>>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>>
>>> > Strike that there are still failures. Investigating. if I cant fix it
>>> in
>>> > the next hour, I will submit them sometime in the evening tomorrow.
>>> >
>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> >
>>> >
>>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>> >
>>> > > Tests pass. Submitting the patches.
>>> > >
>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >
>>> > >
>>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <ro...@gmail.com>
>>> > wrote:
>>> > >
>>> > >> Added a few more tests. Throw NoSuchElementException like Java
>>> > >> Collections when iterating past the end. Things look solid,
>>> performance
>>> > is
>>> > >> 2x. All Math tests pass. I am now waiting for the entire test
>>> suites to
>>> > run
>>> > >> before submitting.
>>> > >>
>>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >>
>>> > >>
>>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com>
>>> > wrote:
>>> > >>
>>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>>> actually
>>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are now
>>> 2x
>>> > faster
>>> > >>> than trunk. While also correcting the behavior (I hope)
>>> > >>>
>>> > >>>
>>> > >>>
>>> >
>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>> > >>>
>>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >>>
>>> > >>>
>>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <robin.anil@gmail.com
>>> > >wrote:
>>> > >>>
>>> > >>>> Also note that this is code gen, I have to create
>>> > Element$keyType$Value
>>> > >>>> for each and every combination not just int double. and also
>>> update
>>> > all
>>> > >>>> callers to user ElementIntDouble instead of Element. Is it worth
>>> it ?
>>> > >>>>
>>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >>>>
>>> > >>>>
>>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>>> ted.dunning@gmail.com
>>> > >wrote:
>>> > >>>>
>>> > >>>>> Collections (no longer colt collections) are now part of mahout
>>> math.
>>> > >>>>>  No
>>> > >>>>> need to keep them separate.  The lower iterator can reference
>>> > >>>>> Vector.Element
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <
>>> robin.anil@gmail.com>
>>> > >>>>> wrote:
>>> > >>>>>
>>> > >>>>> > I would have loved to but Element is a sub interface in
>>> Vector. If
>>> > >>>>> we want
>>> > >>>>> > to keep colt collections separate we have to keep this
>>> separation.
>>> > >>>>> >
>>> > >>>>>
>>> > >>>>
>>> > >>>>
>>> > >>>
>>> > >>
>>> > >
>>> >
>>>
>>
>>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Spoke too soon still failure.  I am uploading the latest patch. These are
the current failing tests.

 ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemovalMR:103->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
not expecting cluster:{0:1.0,1:1.0}

ClusterClassificationDriverTest.testVectorClassificationWithOutlierRemoval:139->assertVectorsWithOutlierRemoval:189->checkClustersWithOutlierRemoval:239->Assert.assertTrue:41->Assert.fail:88
not expecting cluster:{0:1.0,1:1.0}

ClusterClassificationDriverTest.testVectorClassificationWithoutOutlierRemoval:121->assertVectorsWithoutOutlierRemoval:193->assertFirstClusterWithoutOutlierRemoval:218->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
null

ClusterOutputPostProcessorTest.testTopDownClustering:102->assertPostProcessedOutput:188->assertTopLevelCluster:115->assertPointsInSecondTopLevelCluster:134->Assert.assertTrue:52->Assert.assertTrue:41->Assert.fail:86
null

VectorSimilarityMeasuresTest.testPearsonCorrelationSimilarity:109->Assert.assertEquals:592->Assert.assertEquals:494->Assert.failNotEquals:743->Assert.fail:88
expected:<0.5303300858899108> but was:<0.38729833462074176>


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 12:24 AM, Robin Anil <ro...@gmail.com> wrote:

> Found it, fixed it. I am submitting soon.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> Robin,
>>
>> Can you make sure that the patches are somewhere that Dan can pick up this
>> work?  He is in GMT+2 and is probably about to appear on the scene.
>>
>>
>>
>> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com> wrote:
>>
>> > Strike that there are still failures. Investigating. if I cant fix it in
>> > the next hour, I will submit them sometime in the evening tomorrow.
>> >
>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> >
>> >
>> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
>> wrote:
>> >
>> > > Tests pass. Submitting the patches.
>> > >
>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >
>> > >
>> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <ro...@gmail.com>
>> > wrote:
>> > >
>> > >> Added a few more tests. Throw NoSuchElementException like Java
>> > >> Collections when iterating past the end. Things look solid,
>> performance
>> > is
>> > >> 2x. All Math tests pass. I am now waiting for the entire test suites
>> to
>> > run
>> > >> before submitting.
>> > >>
>> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >>
>> > >>
>> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com>
>> > wrote:
>> > >>
>> > >>> I am not sure what I did. But removing Guava Abstract iterator
>> actually
>> > >>> sped up the dot, cosine, euclidean by another 60%. Things are now 2x
>> > faster
>> > >>> than trunk. While also correcting the behavior (I hope)
>> > >>>
>> > >>>
>> > >>>
>> >
>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>> > >>>
>> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >>>
>> > >>>
>> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <robin.anil@gmail.com
>> > >wrote:
>> > >>>
>> > >>>> Also note that this is code gen, I have to create
>> > Element$keyType$Value
>> > >>>> for each and every combination not just int double. and also update
>> > all
>> > >>>> callers to user ElementIntDouble instead of Element. Is it worth
>> it ?
>> > >>>>
>> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >>>>
>> > >>>>
>> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <
>> ted.dunning@gmail.com
>> > >wrote:
>> > >>>>
>> > >>>>> Collections (no longer colt collections) are now part of mahout
>> math.
>> > >>>>>  No
>> > >>>>> need to keep them separate.  The lower iterator can reference
>> > >>>>> Vector.Element
>> > >>>>>
>> > >>>>>
>> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <robin.anil@gmail.com
>> >
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>> > I would have loved to but Element is a sub interface in Vector.
>> If
>> > >>>>> we want
>> > >>>>> > to keep colt collections separate we have to keep this
>> separation.
>> > >>>>> >
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>
>> > >>
>> > >
>> >
>>
>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Found it, fixed it. I am submitting soon.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 11:56 PM, Ted Dunning <te...@gmail.com> wrote:

> Robin,
>
> Can you make sure that the patches are somewhere that Dan can pick up this
> work?  He is in GMT+2 and is probably about to appear on the scene.
>
>
>
> On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Strike that there are still failures. Investigating. if I cant fix it in
> > the next hour, I will submit them sometime in the evening tomorrow.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > Tests pass. Submitting the patches.
> > >
> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >
> > >
> > > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >
> > >> Added a few more tests. Throw NoSuchElementException like Java
> > >> Collections when iterating past the end. Things look solid,
> performance
> > is
> > >> 2x. All Math tests pass. I am now waiting for the entire test suites
> to
> > run
> > >> before submitting.
> > >>
> > >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>
> > >>
> > >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >>
> > >>> I am not sure what I did. But removing Guava Abstract iterator
> actually
> > >>> sped up the dot, cosine, euclidean by another 60%. Things are now 2x
> > faster
> > >>> than trunk. While also correcting the behavior (I hope)
> > >>>
> > >>>
> > >>>
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > >>>
> > >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>
> > >>>
> > >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <robin.anil@gmail.com
> > >wrote:
> > >>>
> > >>>> Also note that this is code gen, I have to create
> > Element$keyType$Value
> > >>>> for each and every combination not just int double. and also update
> > all
> > >>>> callers to user ElementIntDouble instead of Element. Is it worth it
> ?
> > >>>>
> > >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >>>>
> > >>>>
> > >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <ted.dunning@gmail.com
> > >wrote:
> > >>>>
> > >>>>> Collections (no longer colt collections) are now part of mahout
> math.
> > >>>>>  No
> > >>>>> need to keep them separate.  The lower iterator can reference
> > >>>>> Vector.Element
> > >>>>>
> > >>>>>
> > >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <ro...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>> > I would have loved to but Element is a sub interface in Vector.
> If
> > >>>>> we want
> > >>>>> > to keep colt collections separate we have to keep this
> separation.
> > >>>>> >
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >
> >
>

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
Robin,

Can you make sure that the patches are somewhere that Dan can pick up this
work?  He is in GMT+2 and is probably about to appear on the scene.



On Sun, Apr 14, 2013 at 9:34 PM, Robin Anil <ro...@gmail.com> wrote:

> Strike that there are still failures. Investigating. if I cant fix it in
> the next hour, I will submit them sometime in the evening tomorrow.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Tests pass. Submitting the patches.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> >> Added a few more tests. Throw NoSuchElementException like Java
> >> Collections when iterating past the end. Things look solid, performance
> is
> >> 2x. All Math tests pass. I am now waiting for the entire test suites to
> run
> >> before submitting.
> >>
> >> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>
> >>
> >> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >>
> >>> I am not sure what I did. But removing Guava Abstract iterator actually
> >>> sped up the dot, cosine, euclidean by another 60%. Things are now 2x
> faster
> >>> than trunk. While also correcting the behavior (I hope)
> >>>
> >>>
> >>>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> >>>
> >>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>
> >>>
> >>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <robin.anil@gmail.com
> >wrote:
> >>>
> >>>> Also note that this is code gen, I have to create
> Element$keyType$Value
> >>>> for each and every combination not just int double. and also update
> all
> >>>> callers to user ElementIntDouble instead of Element. Is it worth it ?
> >>>>
> >>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >>>>
> >>>>
> >>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >>>>
> >>>>> Collections (no longer colt collections) are now part of mahout math.
> >>>>>  No
> >>>>> need to keep them separate.  The lower iterator can reference
> >>>>> Vector.Element
> >>>>>
> >>>>>
> >>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <ro...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> > I would have loved to but Element is a sub interface in Vector. If
> >>>>> we want
> >>>>> > to keep colt collections separate we have to keep this separation.
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Strike that there are still failures. Investigating. if I cant fix it in
the next hour, I will submit them sometime in the evening tomorrow.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 11:33 PM, Robin Anil <ro...@gmail.com> wrote:

> Tests pass. Submitting the patches.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> Added a few more tests. Throw NoSuchElementException like Java
>> Collections when iterating past the end. Things look solid, performance is
>> 2x. All Math tests pass. I am now waiting for the entire test suites to run
>> before submitting.
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com> wrote:
>>
>>> I am not sure what I did. But removing Guava Abstract iterator actually
>>> sped up the dot, cosine, euclidean by another 60%. Things are now 2x faster
>>> than trunk. While also correcting the behavior (I hope)
>>>
>>>
>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <ro...@gmail.com>wrote:
>>>
>>>> Also note that this is code gen, I have to create Element$keyType$Value
>>>> for each and every combination not just int double. and also update all
>>>> callers to user ElementIntDouble instead of Element. Is it worth it ?
>>>>
>>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>>
>>>>
>>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <te...@gmail.com>wrote:
>>>>
>>>>> Collections (no longer colt collections) are now part of mahout math.
>>>>>  No
>>>>> need to keep them separate.  The lower iterator can reference
>>>>> Vector.Element
>>>>>
>>>>>
>>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <ro...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > I would have loved to but Element is a sub interface in Vector. If
>>>>> we want
>>>>> > to keep colt collections separate we have to keep this separation.
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Tests pass. Submitting the patches.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 11:17 PM, Robin Anil <ro...@gmail.com> wrote:

> Added a few more tests. Throw NoSuchElementException like Java Collections
> when iterating past the end. Things look solid, performance is 2x. All Math
> tests pass. I am now waiting for the entire test suites to run before
> submitting.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> I am not sure what I did. But removing Guava Abstract iterator actually
>> sped up the dot, cosine, euclidean by another 60%. Things are now 2x faster
>> than trunk. While also correcting the behavior (I hope)
>>
>>
>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <ro...@gmail.com> wrote:
>>
>>> Also note that this is code gen, I have to create Element$keyType$Value
>>> for each and every combination not just int double. and also update all
>>> callers to user ElementIntDouble instead of Element. Is it worth it ?
>>>
>>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>>
>>>
>>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <te...@gmail.com>wrote:
>>>
>>>> Collections (no longer colt collections) are now part of mahout math.
>>>>  No
>>>> need to keep them separate.  The lower iterator can reference
>>>> Vector.Element
>>>>
>>>>
>>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <ro...@gmail.com>
>>>> wrote:
>>>>
>>>> > I would have loved to but Element is a sub interface in Vector. If we
>>>> want
>>>> > to keep colt collections separate we have to keep this separation.
>>>> >
>>>>
>>>
>>>
>>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Added a few more tests. Throw NoSuchElementException like Java Collections
when iterating past the end. Things look solid, performance is 2x. All Math
tests pass. I am now waiting for the entire test suites to run before
submitting.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 9:49 PM, Robin Anil <ro...@gmail.com> wrote:

> I am not sure what I did. But removing Guava Abstract iterator actually
> sped up the dot, cosine, euclidean by another 60%. Things are now 2x faster
> than trunk. While also correcting the behavior (I hope)
>
>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> Also note that this is code gen, I have to create Element$keyType$Value
>> for each and every combination not just int double. and also update all
>> callers to user ElementIntDouble instead of Element. Is it worth it ?
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> Collections (no longer colt collections) are now part of mahout math.  No
>>> need to keep them separate.  The lower iterator can reference
>>> Vector.Element
>>>
>>>
>>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>>
>>> > I would have loved to but Element is a sub interface in Vector. If we
>>> want
>>> > to keep colt collections separate we have to keep this separation.
>>> >
>>>
>>
>>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
I am not sure what I did. But removing Guava Abstract iterator actually
sped up the dot, cosine, euclidean by another 60%. Things are now 2x faster
than trunk. While also correcting the behavior (I hope)

https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 8:56 PM, Robin Anil <ro...@gmail.com> wrote:

> Also note that this is code gen, I have to create Element$keyType$Value
> for each and every combination not just int double. and also update all
> callers to user ElementIntDouble instead of Element. Is it worth it ?
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> Collections (no longer colt collections) are now part of mahout math.  No
>> need to keep them separate.  The lower iterator can reference
>> Vector.Element
>>
>>
>> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <ro...@gmail.com> wrote:
>>
>> > I would have loved to but Element is a sub interface in Vector. If we
>> want
>> > to keep colt collections separate we have to keep this separation.
>> >
>>
>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Also note that this is code gen, I have to create Element$keyType$Value for
each and every combination not just int double. and also update all callers
to user ElementIntDouble instead of Element. Is it worth it ?

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 8:46 PM, Ted Dunning <te...@gmail.com> wrote:

> Collections (no longer colt collections) are now part of mahout math.  No
> need to keep them separate.  The lower iterator can reference
> Vector.Element
>
>
> On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > I would have loved to but Element is a sub interface in Vector. If we
> want
> > to keep colt collections separate we have to keep this separation.
> >
>

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
Collections (no longer colt collections) are now part of mahout math.  No
need to keep them separate.  The lower iterator can reference Vector.Element


On Sun, Apr 14, 2013 at 6:24 PM, Robin Anil <ro...@gmail.com> wrote:

> I would have loved to but Element is a sub interface in Vector. If we want
> to keep colt collections separate we have to keep this separation.
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
I would have loved to but Element is a sub interface in Vector. If we want
to keep colt collections separate we have to keep this separation.

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
Hmph..

You delegate to the lower iterator entirely.  Why not just return it in the
first place?


On Sun, Apr 14, 2013 at 5:27 PM, Robin Anil <ro...@gmail.com> wrote:

> I am working on a patch. Here is a sample. This one is for RASV. I put
> Dan's example as a test case.
>
>
>
>    1.   private final class NonDefaultIterator implements Iterator<Element>
>    {
>    2.     private final class NonDefaultElement implements Element {
>    3.       @Override
>    4.       public double get() {
>    5.         return mapElement.get();
>    6.       }
>    7.
>    8.       @Override
>    9.       public int index() {
>    10.         return mapElement.index();
>    11.       }
>    12.
>    13.       @Override
>    14.       public void set(double value) {
>    15.         invalidateCachedLength();
>    16.         mapElement.set(value);
>    17.       }
>    18.     }
>    19.
>    20.     private final NonDefaultElement element =
> newNonDefaultElement();
>    21.     private final Iterator<MapElement> iterator;
>    22.     private MapElement mapElement;
>    23.
>    24.     private NonDefaultIterator() {
>    25.       this.iterator = values.iterator();
>    26.     }
>    27.
>    28.     @Override
>    29.     public boolean hasNext() {
>    30.       return iterator.hasNext();
>    31.     }
>    32.
>    33.     @Override
>    34.     public Element next() {
>    35.       mapElement = iterator.next();
>    36.       return element;
>    37.     }
>    38.
>    39.     @Override
>    40.     public void remove() {
>    41.       throw new UnsupportedOperationException();
>    42.     }
>    43.   }
>    44.
>
>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 7:04 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Well... current iterator style with a non-side-effecting version of
> > hasNext(), of course.
> >
> > Reusing the container is OK if the performance hit is substantial.
> >
> >
> > On Sun, Apr 14, 2013 at 5:02 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > Also the Tests crash due to excessive GC. The performance degradation
> > there
> > > is very visible(my cpu spikes up). I think there is good case for the
> > > current iteration style, just that we have to, not use the java
> Iterator
> > > contract and confuse clients.
> > >
> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >
> > >
> > > On Sun, Apr 14, 2013 at 6:59 PM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >
> > > > Yes. All final.
> > > >
> > > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > >
> > > >
> > > > On Sun, Apr 14, 2013 at 6:55 PM, Ted Dunning <ted.dunning@gmail.com
> > > >wrote:
> > > >
> > > >> Did you mark the class and fields all as final?
> > > >>
> > > >> That might help the compiler realize it could in-line stuff and
> avoid
> > > the
> > > >> constructor (not likely, but possible)
> > > >>
> > > >>
> > > >> On Sun, Apr 14, 2013 at 4:52 PM, Robin Anil <ro...@gmail.com>
> > > wrote:
> > > >>
> > > >> > With a new immutable Element in the iterator, the iteration
> behavior
> > > is
> > > >> > corrected but. There is a performance degradation of about 10% and
> > > >> > nullifies what I have done with the patch.
> > > >> >
> > > >> > See
> > > >> >
> > > >> >
> > > >>
> > >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > > >> >
> > > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > > >> >
> > > >> >
> > > >> > On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Yeah... but we still have to fix the iterator.
> > > >> > >
> > > >> > >
> > > >> > > On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <
> robin.anil@gmail.com
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Here is an iteration style that works as is with today's
> > behaviour
> > > >> of
> > > >> > > > hasNext
> > > >> > > >
> > > >> > > >    1.
> > > >> > > >    2.  Element thisElement = null;
> > > >> > > >    3.       Element thatElement = null;
> > > >> > > >    4.       boolean advanceThis = true;
> > > >> > > >    5.       boolean advanceThat = true;
> > > >> > > >    6.
> > > >> > > >    7.       Iterator<Element> thisNonZero =
> > this.iterateNonZero();
> > > >> > > >    8.       Iterator<Element> thatNonZero =
> x.iterateNonZero();
> > > >> > > >    9.
> > > >> > > >    10.       double result = 0.0;
> > > >> > > >    11.       while (true) {
> > > >> > > >    12.         *if (advanceThis) {
> > > >> > > >    *
> > > >> > > >    13. *          if (!thisNonZero.hasNext()) {
> > > >> > > >    *
> > > >> > > >    14. *            break;
> > > >> > > >    *
> > > >> > > >    15. *          }
> > > >> > > >    *
> > > >> > > >    16. *          thisElement = thisNonZero.next();
> > > >> > > >    *
> > > >> > > >    17. *        }
> > > >> > > >    *
> > > >> > > >    18. *        if (advanceThat) {
> > > >> > > >    *
> > > >> > > >    19. *          if (!thatNonZero.hasNext()) {
> > > >> > > >    *
> > > >> > > >    20. *            break;
> > > >> > > >    *
> > > >> > > >    21. *          }
> > > >> > > >    *
> > > >> > > >    22. *          thatElement = thatNonZero.next();
> > > >> > > >    *
> > > >> > > >    23. *        }*
> > > >> > > >    24.         if (thisElement.index() ==
> thatElement.index()) {
> > > >> > > >    25.
> > > >> > > >    26.           result += thisElement.get() *
> > thatElement.get();
> > > >> > > >    27.           advanceThis = true;
> > > >> > > >    28.           advanceThat = true;
> > > >> > > >    29.         } else if (thisElement.index() <
> > > >> thatElement.index()) {
> > > >> > > >    30.           advanceThis = true;
> > > >> > > >    31.           advanceThat = false;
> > > >> > > >    32.         } else {
> > > >> > > >    33.           advanceThis = false;
> > > >> > > >    34.           advanceThat = true;
> > > >> > > >    35.         }
> > > >> > > >    36.       }
> > > >> > > >
> > > >> > > >
> > > >> > > > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <
> > > ted.dunning@gmail.com
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > The caller is not at fault here.  The problem is that
> hasNext
> > is
> > > >> > > > advancing
> > > >> > > > > the iterator due to a side effect.  The side effect is
> > > impossible
> > > >> to
> > > >> > > > avoid
> > > >> > > > > at the level of the caller.
> > > >> > > > >
> > > >> > > > > Sent from my iPhone
> > > >> > > > >
> > > >> > > > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com>
> > wrote:
> > > >> > > > >
> > > >> > > > > > I'm sure I did (at least much of) the AbstractIterator
> > change
> > > so
> > > >> > > blame
> > > >> > > > > > me... but I think the pattern itself is just fine. It's
> used
> > > in
> > > >> > many
> > > >> > > > > > places in the project. Reusing the value object is a big
> win
> > > in
> > > >> > some
> > > >> > > > > > places. Allocating objects is fast but a trillion of them
> > > still
> > > >> > adds
> > > >> > > > > > up.
> > > >> > > > > >
> > > >> > > > > > It does contain a requirement, and that is that the caller
> > is
> > > >> > > supposed
> > > >> > > > > > to copy/clone the value if it will be used at all after
> the
> > > next
> > > >> > > > > > iterator operation. That's the 0th option, to just fix the
> > > >> caller
> > > >> > > > > > here.
> > > >> > > > > >
> > > >> > > > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <
> > > >> > ted.dunning@gmail.com>
> > > >> > > > > wrote:
> > > >> > > > > >> The contract of computeNext is that there are no side
> > effects
> > > >> > > visible
> > > >> > > > > >> outside (i.e. apparent functional style).  This is
> required
> > > >> since
> > > >> > > > > >> computeNext is called from hasNext().
> > > >> > > > > >>
> > > >> > > > > >> We are using a side-effecting style so we have a bug.
> > > >> > > > > >>
> > > >> > > > > >> We have two choices:
> > > >> > > > > >>
> > > >> > > > > >> a) use functional style. This will *require* that we
> > > allocate a
> > > >> > new
> > > >> > > > > >> container element on every call to computeNext.  This is
> > best
> > > >> for
> > > >> > > the
> > > >> > > > > user
> > > >> > > > > >> because they will have fewer surprising bugs due to
> reuse.
> > >  If
> > > >> > > > > allocation
> > > >> > > > > >> is actually as bad as some people think (I remain
> skeptical
> > > of
> > > >> > that
> > > >> > > > > without
> > > >> > > > > >> tests) then this is a bad move.  If allocation of totally
> > > >> > ephemeral
> > > >> > > > > objects
> > > >> > > > > >> is as cheap as I think, then this would be a good move.
> > > >> > > > > >>
> > > >> > > > > >> b) stop using AbstractIterator and continue with the
> re-use
> > > >> style.
> > > >> > > >  And
> > > >> > > > > add
> > > >> > > > > >> a comment to prevent a bright spark from reverting this
> > > change.
> > > >> >  (I
> > > >> > > > > suspect
> > > >> > > > > >> that the bright spark who did this in the first place was
> > me
> > > >> so I
> > > >> > > can
> > > >> > > > be
> > > >> > > > > >> rude)
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Ignore that, I found an issue with the Dense iterator. Those test are
passing now except for
one(testTimesSquaredTimesVector(org.apache.mahout.math.PivotedMatrixTest)).
i have also updated the review request.

https://reviews.apache.org/r/10455/diff/#index_header


On Sun, Apr 14, 2013 at 8:13 PM, Robin Anil <ro...@gmail.com> wrote:

> After fixing iterator(assuming my patch is correct).
>
> The following tests are failing in math because of incorrect usage of
> next() without checking hasNext(). I would need some help in fixing them as
> I have never touched the code before
>
>   SequentialBigSvdTest.testSingularValues:40->assertEquals:64 » NullPointer
>   LogLikelihoodTest.testFrequencyComparison:108 » NullPointer
>   TestHebbianSolver.testHebbianSolver:86->timeSolver:59 » NullPointer
>   MultiNormalTest.testDiagonal:54 » NullPointer
>   PermutedVectorViewTest.testIterators:60 » NullPointer
>   WeightedVectorTest.testProjection:66 » NullPointer
>   WeightedVectorTest>AbstractVectorTest.testSimpleOps:52 » NullPointer
>
>
> See the iterator patch
> https://reviews.apache.org/r/10455/diff/#index_header
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 7:27 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> I am working on a patch. Here is a sample. This one is for RASV. I put
>> Dan's example as a test case.
>>
>>
>>
>>    1.   private final class NonDefaultIterator implementsIterator<Element> {
>>    2.     private final class NonDefaultElement implements Element {
>>    3.       @Override
>>    4.       public double get() {
>>    5.         return mapElement.get();
>>    6.       }
>>    7.
>>    8.       @Override
>>    9.       public int index() {
>>    10.         return mapElement.index();
>>    11.       }
>>    12.
>>    13.       @Override
>>    14.       public void set(double value) {
>>    15.         invalidateCachedLength();
>>    16.         mapElement.set(value);
>>    17.       }
>>    18.     }
>>    19.
>>    20.     private final NonDefaultElement element = newNonDefaultElement();
>>    21.     private final Iterator<MapElement> iterator;
>>    22.     private MapElement mapElement;
>>    23.
>>    24.     private NonDefaultIterator() {
>>    25.       this.iterator = values.iterator();
>>    26.     }
>>    27.
>>    28.     @Override
>>    29.     public boolean hasNext() {
>>    30.       return iterator.hasNext();
>>    31.     }
>>    32.
>>    33.     @Override
>>    34.     public Element next() {
>>    35.       mapElement = iterator.next();
>>    36.       return element;
>>    37.     }
>>    38.
>>    39.     @Override
>>    40.     public void remove() {
>>    41.       throw new UnsupportedOperationException();
>>    42.     }
>>    43.   }
>>    44.
>>
>>
>>
>> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>
>>
>> On Sun, Apr 14, 2013 at 7:04 PM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> Well... current iterator style with a non-side-effecting version of
>>> hasNext(), of course.
>>>
>>> Reusing the container is OK if the performance hit is substantial.
>>>
>>>
>>> On Sun, Apr 14, 2013 at 5:02 PM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>>
>>> > Also the Tests crash due to excessive GC. The performance degradation
>>> there
>>> > is very visible(my cpu spikes up). I think there is good case for the
>>> > current iteration style, just that we have to, not use the java
>>> Iterator
>>> > contract and confuse clients.
>>> >
>>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> >
>>> >
>>> > On Sun, Apr 14, 2013 at 6:59 PM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>> >
>>> > > Yes. All final.
>>> > >
>>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >
>>> > >
>>> > > On Sun, Apr 14, 2013 at 6:55 PM, Ted Dunning <ted.dunning@gmail.com
>>> > >wrote:
>>> > >
>>> > >> Did you mark the class and fields all as final?
>>> > >>
>>> > >> That might help the compiler realize it could in-line stuff and
>>> avoid
>>> > the
>>> > >> constructor (not likely, but possible)
>>> > >>
>>> > >>
>>> > >> On Sun, Apr 14, 2013 at 4:52 PM, Robin Anil <ro...@gmail.com>
>>> > wrote:
>>> > >>
>>> > >> > With a new immutable Element in the iterator, the iteration
>>> behavior
>>> > is
>>> > >> > corrected but. There is a performance degradation of about 10% and
>>> > >> > nullifies what I have done with the patch.
>>> > >> >
>>> > >> > See
>>> > >> >
>>> > >> >
>>> > >>
>>> >
>>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>>> > >> >
>>> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>>> > >> >
>>> > >> >
>>> > >> > On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <
>>> ted.dunning@gmail.com>
>>> > >> > wrote:
>>> > >> >
>>> > >> > > Yeah... but we still have to fix the iterator.
>>> > >> > >
>>> > >> > >
>>> > >> > > On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <
>>> robin.anil@gmail.com>
>>> > >> > wrote:
>>> > >> > >
>>> > >> > > > Here is an iteration style that works as is with today's
>>> behaviour
>>> > >> of
>>> > >> > > > hasNext
>>> > >> > > >
>>> > >> > > >    1.
>>> > >> > > >    2.  Element thisElement = null;
>>> > >> > > >    3.       Element thatElement = null;
>>> > >> > > >    4.       boolean advanceThis = true;
>>> > >> > > >    5.       boolean advanceThat = true;
>>> > >> > > >    6.
>>> > >> > > >    7.       Iterator<Element> thisNonZero =
>>> this.iterateNonZero();
>>> > >> > > >    8.       Iterator<Element> thatNonZero =
>>> x.iterateNonZero();
>>> > >> > > >    9.
>>> > >> > > >    10.       double result = 0.0;
>>> > >> > > >    11.       while (true) {
>>> > >> > > >    12.         *if (advanceThis) {
>>> > >> > > >    *
>>> > >> > > >    13. *          if (!thisNonZero.hasNext()) {
>>> > >> > > >    *
>>> > >> > > >    14. *            break;
>>> > >> > > >    *
>>> > >> > > >    15. *          }
>>> > >> > > >    *
>>> > >> > > >    16. *          thisElement = thisNonZero.next();
>>> > >> > > >    *
>>> > >> > > >    17. *        }
>>> > >> > > >    *
>>> > >> > > >    18. *        if (advanceThat) {
>>> > >> > > >    *
>>> > >> > > >    19. *          if (!thatNonZero.hasNext()) {
>>> > >> > > >    *
>>> > >> > > >    20. *            break;
>>> > >> > > >    *
>>> > >> > > >    21. *          }
>>> > >> > > >    *
>>> > >> > > >    22. *          thatElement = thatNonZero.next();
>>> > >> > > >    *
>>> > >> > > >    23. *        }*
>>> > >> > > >    24.         if (thisElement.index() ==
>>> thatElement.index()) {
>>> > >> > > >    25.
>>> > >> > > >    26.           result += thisElement.get() *
>>> thatElement.get();
>>> > >> > > >    27.           advanceThis = true;
>>> > >> > > >    28.           advanceThat = true;
>>> > >> > > >    29.         } else if (thisElement.index() <
>>> > >> thatElement.index()) {
>>> > >> > > >    30.           advanceThis = true;
>>> > >> > > >    31.           advanceThat = false;
>>> > >> > > >    32.         } else {
>>> > >> > > >    33.           advanceThis = false;
>>> > >> > > >    34.           advanceThat = true;
>>> > >> > > >    35.         }
>>> > >> > > >    36.       }
>>> > >> > > >
>>> > >> > > >
>>> > >> > > > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <
>>> > ted.dunning@gmail.com
>>> > >> >
>>> > >> > > > wrote:
>>> > >> > > >
>>> > >> > > > > The caller is not at fault here.  The problem is that
>>> hasNext is
>>> > >> > > > advancing
>>> > >> > > > > the iterator due to a side effect.  The side effect is
>>> > impossible
>>> > >> to
>>> > >> > > > avoid
>>> > >> > > > > at the level of the caller.
>>> > >> > > > >
>>> > >> > > > > Sent from my iPhone
>>> > >> > > > >
>>> > >> > > > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com>
>>> wrote:
>>> > >> > > > >
>>> > >> > > > > > I'm sure I did (at least much of) the AbstractIterator
>>> change
>>> > so
>>> > >> > > blame
>>> > >> > > > > > me... but I think the pattern itself is just fine. It's
>>> used
>>> > in
>>> > >> > many
>>> > >> > > > > > places in the project. Reusing the value object is a big
>>> win
>>> > in
>>> > >> > some
>>> > >> > > > > > places. Allocating objects is fast but a trillion of them
>>> > still
>>> > >> > adds
>>> > >> > > > > > up.
>>> > >> > > > > >
>>> > >> > > > > > It does contain a requirement, and that is that the
>>> caller is
>>> > >> > > supposed
>>> > >> > > > > > to copy/clone the value if it will be used at all after
>>> the
>>> > next
>>> > >> > > > > > iterator operation. That's the 0th option, to just fix the
>>> > >> caller
>>> > >> > > > > > here.
>>> > >> > > > > >
>>> > >> > > > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <
>>> > >> > ted.dunning@gmail.com>
>>> > >> > > > > wrote:
>>> > >> > > > > >> The contract of computeNext is that there are no side
>>> effects
>>> > >> > > visible
>>> > >> > > > > >> outside (i.e. apparent functional style).  This is
>>> required
>>> > >> since
>>> > >> > > > > >> computeNext is called from hasNext().
>>> > >> > > > > >>
>>> > >> > > > > >> We are using a side-effecting style so we have a bug.
>>> > >> > > > > >>
>>> > >> > > > > >> We have two choices:
>>> > >> > > > > >>
>>> > >> > > > > >> a) use functional style. This will *require* that we
>>> > allocate a
>>> > >> > new
>>> > >> > > > > >> container element on every call to computeNext.  This is
>>> best
>>> > >> for
>>> > >> > > the
>>> > >> > > > > user
>>> > >> > > > > >> because they will have fewer surprising bugs due to
>>> reuse.
>>> >  If
>>> > >> > > > > allocation
>>> > >> > > > > >> is actually as bad as some people think (I remain
>>> skeptical
>>> > of
>>> > >> > that
>>> > >> > > > > without
>>> > >> > > > > >> tests) then this is a bad move.  If allocation of totally
>>> > >> > ephemeral
>>> > >> > > > > objects
>>> > >> > > > > >> is as cheap as I think, then this would be a good move.
>>> > >> > > > > >>
>>> > >> > > > > >> b) stop using AbstractIterator and continue with the
>>> re-use
>>> > >> style.
>>> > >> > > >  And
>>> > >> > > > > add
>>> > >> > > > > >> a comment to prevent a bright spark from reverting this
>>> > change.
>>> > >> >  (I
>>> > >> > > > > suspect
>>> > >> > > > > >> that the bright spark who did this in the first place
>>> was me
>>> > >> so I
>>> > >> > > can
>>> > >> > > > be
>>> > >> > > > > >> rude)
>>> > >> > > > >
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
After fixing iterator(assuming my patch is correct).

The following tests are failing in math because of incorrect usage of
next() without checking hasNext(). I would need some help in fixing them as
I have never touched the code before

  SequentialBigSvdTest.testSingularValues:40->assertEquals:64 » NullPointer
  LogLikelihoodTest.testFrequencyComparison:108 » NullPointer
  TestHebbianSolver.testHebbianSolver:86->timeSolver:59 » NullPointer
  MultiNormalTest.testDiagonal:54 » NullPointer
  PermutedVectorViewTest.testIterators:60 » NullPointer
  WeightedVectorTest.testProjection:66 » NullPointer
  WeightedVectorTest>AbstractVectorTest.testSimpleOps:52 » NullPointer


See the iterator patch
https://reviews.apache.org/r/10455/diff/#index_header

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 7:27 PM, Robin Anil <ro...@gmail.com> wrote:

> I am working on a patch. Here is a sample. This one is for RASV. I put
> Dan's example as a test case.
>
>
>
>    1.   private final class NonDefaultIterator implementsIterator<Element> {
>    2.     private final class NonDefaultElement implements Element {
>    3.       @Override
>    4.       public double get() {
>    5.         return mapElement.get();
>    6.       }
>    7.
>    8.       @Override
>    9.       public int index() {
>    10.         return mapElement.index();
>    11.       }
>    12.
>    13.       @Override
>    14.       public void set(double value) {
>    15.         invalidateCachedLength();
>    16.         mapElement.set(value);
>    17.       }
>    18.     }
>    19.
>    20.     private final NonDefaultElement element = newNonDefaultElement();
>    21.     private final Iterator<MapElement> iterator;
>    22.     private MapElement mapElement;
>    23.
>    24.     private NonDefaultIterator() {
>    25.       this.iterator = values.iterator();
>    26.     }
>    27.
>    28.     @Override
>    29.     public boolean hasNext() {
>    30.       return iterator.hasNext();
>    31.     }
>    32.
>    33.     @Override
>    34.     public Element next() {
>    35.       mapElement = iterator.next();
>    36.       return element;
>    37.     }
>    38.
>    39.     @Override
>    40.     public void remove() {
>    41.       throw new UnsupportedOperationException();
>    42.     }
>    43.   }
>    44.
>
>
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 7:04 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> Well... current iterator style with a non-side-effecting version of
>> hasNext(), of course.
>>
>> Reusing the container is OK if the performance hit is substantial.
>>
>>
>> On Sun, Apr 14, 2013 at 5:02 PM, Robin Anil <ro...@gmail.com> wrote:
>>
>> > Also the Tests crash due to excessive GC. The performance degradation
>> there
>> > is very visible(my cpu spikes up). I think there is good case for the
>> > current iteration style, just that we have to, not use the java Iterator
>> > contract and confuse clients.
>> >
>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> >
>> >
>> > On Sun, Apr 14, 2013 at 6:59 PM, Robin Anil <ro...@gmail.com>
>> wrote:
>> >
>> > > Yes. All final.
>> > >
>> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >
>> > >
>> > > On Sun, Apr 14, 2013 at 6:55 PM, Ted Dunning <ted.dunning@gmail.com
>> > >wrote:
>> > >
>> > >> Did you mark the class and fields all as final?
>> > >>
>> > >> That might help the compiler realize it could in-line stuff and avoid
>> > the
>> > >> constructor (not likely, but possible)
>> > >>
>> > >>
>> > >> On Sun, Apr 14, 2013 at 4:52 PM, Robin Anil <ro...@gmail.com>
>> > wrote:
>> > >>
>> > >> > With a new immutable Element in the iterator, the iteration
>> behavior
>> > is
>> > >> > corrected but. There is a performance degradation of about 10% and
>> > >> > nullifies what I have done with the patch.
>> > >> >
>> > >> > See
>> > >> >
>> > >> >
>> > >>
>> >
>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>> > >> >
>> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> > >> >
>> > >> >
>> > >> > On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <
>> ted.dunning@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Yeah... but we still have to fix the iterator.
>> > >> > >
>> > >> > >
>> > >> > > On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <
>> robin.anil@gmail.com>
>> > >> > wrote:
>> > >> > >
>> > >> > > > Here is an iteration style that works as is with today's
>> behaviour
>> > >> of
>> > >> > > > hasNext
>> > >> > > >
>> > >> > > >    1.
>> > >> > > >    2.  Element thisElement = null;
>> > >> > > >    3.       Element thatElement = null;
>> > >> > > >    4.       boolean advanceThis = true;
>> > >> > > >    5.       boolean advanceThat = true;
>> > >> > > >    6.
>> > >> > > >    7.       Iterator<Element> thisNonZero =
>> this.iterateNonZero();
>> > >> > > >    8.       Iterator<Element> thatNonZero = x.iterateNonZero();
>> > >> > > >    9.
>> > >> > > >    10.       double result = 0.0;
>> > >> > > >    11.       while (true) {
>> > >> > > >    12.         *if (advanceThis) {
>> > >> > > >    *
>> > >> > > >    13. *          if (!thisNonZero.hasNext()) {
>> > >> > > >    *
>> > >> > > >    14. *            break;
>> > >> > > >    *
>> > >> > > >    15. *          }
>> > >> > > >    *
>> > >> > > >    16. *          thisElement = thisNonZero.next();
>> > >> > > >    *
>> > >> > > >    17. *        }
>> > >> > > >    *
>> > >> > > >    18. *        if (advanceThat) {
>> > >> > > >    *
>> > >> > > >    19. *          if (!thatNonZero.hasNext()) {
>> > >> > > >    *
>> > >> > > >    20. *            break;
>> > >> > > >    *
>> > >> > > >    21. *          }
>> > >> > > >    *
>> > >> > > >    22. *          thatElement = thatNonZero.next();
>> > >> > > >    *
>> > >> > > >    23. *        }*
>> > >> > > >    24.         if (thisElement.index() == thatElement.index())
>> {
>> > >> > > >    25.
>> > >> > > >    26.           result += thisElement.get() *
>> thatElement.get();
>> > >> > > >    27.           advanceThis = true;
>> > >> > > >    28.           advanceThat = true;
>> > >> > > >    29.         } else if (thisElement.index() <
>> > >> thatElement.index()) {
>> > >> > > >    30.           advanceThis = true;
>> > >> > > >    31.           advanceThat = false;
>> > >> > > >    32.         } else {
>> > >> > > >    33.           advanceThis = false;
>> > >> > > >    34.           advanceThat = true;
>> > >> > > >    35.         }
>> > >> > > >    36.       }
>> > >> > > >
>> > >> > > >
>> > >> > > > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <
>> > ted.dunning@gmail.com
>> > >> >
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > The caller is not at fault here.  The problem is that
>> hasNext is
>> > >> > > > advancing
>> > >> > > > > the iterator due to a side effect.  The side effect is
>> > impossible
>> > >> to
>> > >> > > > avoid
>> > >> > > > > at the level of the caller.
>> > >> > > > >
>> > >> > > > > Sent from my iPhone
>> > >> > > > >
>> > >> > > > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com>
>> wrote:
>> > >> > > > >
>> > >> > > > > > I'm sure I did (at least much of) the AbstractIterator
>> change
>> > so
>> > >> > > blame
>> > >> > > > > > me... but I think the pattern itself is just fine. It's
>> used
>> > in
>> > >> > many
>> > >> > > > > > places in the project. Reusing the value object is a big
>> win
>> > in
>> > >> > some
>> > >> > > > > > places. Allocating objects is fast but a trillion of them
>> > still
>> > >> > adds
>> > >> > > > > > up.
>> > >> > > > > >
>> > >> > > > > > It does contain a requirement, and that is that the caller
>> is
>> > >> > > supposed
>> > >> > > > > > to copy/clone the value if it will be used at all after the
>> > next
>> > >> > > > > > iterator operation. That's the 0th option, to just fix the
>> > >> caller
>> > >> > > > > > here.
>> > >> > > > > >
>> > >> > > > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <
>> > >> > ted.dunning@gmail.com>
>> > >> > > > > wrote:
>> > >> > > > > >> The contract of computeNext is that there are no side
>> effects
>> > >> > > visible
>> > >> > > > > >> outside (i.e. apparent functional style).  This is
>> required
>> > >> since
>> > >> > > > > >> computeNext is called from hasNext().
>> > >> > > > > >>
>> > >> > > > > >> We are using a side-effecting style so we have a bug.
>> > >> > > > > >>
>> > >> > > > > >> We have two choices:
>> > >> > > > > >>
>> > >> > > > > >> a) use functional style. This will *require* that we
>> > allocate a
>> > >> > new
>> > >> > > > > >> container element on every call to computeNext.  This is
>> best
>> > >> for
>> > >> > > the
>> > >> > > > > user
>> > >> > > > > >> because they will have fewer surprising bugs due to reuse.
>> >  If
>> > >> > > > > allocation
>> > >> > > > > >> is actually as bad as some people think (I remain
>> skeptical
>> > of
>> > >> > that
>> > >> > > > > without
>> > >> > > > > >> tests) then this is a bad move.  If allocation of totally
>> > >> > ephemeral
>> > >> > > > > objects
>> > >> > > > > >> is as cheap as I think, then this would be a good move.
>> > >> > > > > >>
>> > >> > > > > >> b) stop using AbstractIterator and continue with the
>> re-use
>> > >> style.
>> > >> > > >  And
>> > >> > > > > add
>> > >> > > > > >> a comment to prevent a bright spark from reverting this
>> > change.
>> > >> >  (I
>> > >> > > > > suspect
>> > >> > > > > >> that the bright spark who did this in the first place was
>> me
>> > >> so I
>> > >> > > can
>> > >> > > > be
>> > >> > > > > >> rude)
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
I am working on a patch. Here is a sample. This one is for RASV. I put
Dan's example as a test case.



   1.   private final class NonDefaultIterator implements Iterator<Element>
   {
   2.     private final class NonDefaultElement implements Element {
   3.       @Override
   4.       public double get() {
   5.         return mapElement.get();
   6.       }
   7.
   8.       @Override
   9.       public int index() {
   10.         return mapElement.index();
   11.       }
   12.
   13.       @Override
   14.       public void set(double value) {
   15.         invalidateCachedLength();
   16.         mapElement.set(value);
   17.       }
   18.     }
   19.
   20.     private final NonDefaultElement element = newNonDefaultElement();
   21.     private final Iterator<MapElement> iterator;
   22.     private MapElement mapElement;
   23.
   24.     private NonDefaultIterator() {
   25.       this.iterator = values.iterator();
   26.     }
   27.
   28.     @Override
   29.     public boolean hasNext() {
   30.       return iterator.hasNext();
   31.     }
   32.
   33.     @Override
   34.     public Element next() {
   35.       mapElement = iterator.next();
   36.       return element;
   37.     }
   38.
   39.     @Override
   40.     public void remove() {
   41.       throw new UnsupportedOperationException();
   42.     }
   43.   }
   44.



Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 7:04 PM, Ted Dunning <te...@gmail.com> wrote:

> Well... current iterator style with a non-side-effecting version of
> hasNext(), of course.
>
> Reusing the container is OK if the performance hit is substantial.
>
>
> On Sun, Apr 14, 2013 at 5:02 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Also the Tests crash due to excessive GC. The performance degradation
> there
> > is very visible(my cpu spikes up). I think there is good case for the
> > current iteration style, just that we have to, not use the java Iterator
> > contract and confuse clients.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Sun, Apr 14, 2013 at 6:59 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > Yes. All final.
> > >
> > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >
> > >
> > > On Sun, Apr 14, 2013 at 6:55 PM, Ted Dunning <ted.dunning@gmail.com
> > >wrote:
> > >
> > >> Did you mark the class and fields all as final?
> > >>
> > >> That might help the compiler realize it could in-line stuff and avoid
> > the
> > >> constructor (not likely, but possible)
> > >>
> > >>
> > >> On Sun, Apr 14, 2013 at 4:52 PM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >>
> > >> > With a new immutable Element in the iterator, the iteration behavior
> > is
> > >> > corrected but. There is a performance degradation of about 10% and
> > >> > nullifies what I have done with the patch.
> > >> >
> > >> > See
> > >> >
> > >> >
> > >>
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> > >> >
> > >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> > >> >
> > >> >
> > >> > On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <
> ted.dunning@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Yeah... but we still have to fix the iterator.
> > >> > >
> > >> > >
> > >> > > On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <robin.anil@gmail.com
> >
> > >> > wrote:
> > >> > >
> > >> > > > Here is an iteration style that works as is with today's
> behaviour
> > >> of
> > >> > > > hasNext
> > >> > > >
> > >> > > >    1.
> > >> > > >    2.  Element thisElement = null;
> > >> > > >    3.       Element thatElement = null;
> > >> > > >    4.       boolean advanceThis = true;
> > >> > > >    5.       boolean advanceThat = true;
> > >> > > >    6.
> > >> > > >    7.       Iterator<Element> thisNonZero =
> this.iterateNonZero();
> > >> > > >    8.       Iterator<Element> thatNonZero = x.iterateNonZero();
> > >> > > >    9.
> > >> > > >    10.       double result = 0.0;
> > >> > > >    11.       while (true) {
> > >> > > >    12.         *if (advanceThis) {
> > >> > > >    *
> > >> > > >    13. *          if (!thisNonZero.hasNext()) {
> > >> > > >    *
> > >> > > >    14. *            break;
> > >> > > >    *
> > >> > > >    15. *          }
> > >> > > >    *
> > >> > > >    16. *          thisElement = thisNonZero.next();
> > >> > > >    *
> > >> > > >    17. *        }
> > >> > > >    *
> > >> > > >    18. *        if (advanceThat) {
> > >> > > >    *
> > >> > > >    19. *          if (!thatNonZero.hasNext()) {
> > >> > > >    *
> > >> > > >    20. *            break;
> > >> > > >    *
> > >> > > >    21. *          }
> > >> > > >    *
> > >> > > >    22. *          thatElement = thatNonZero.next();
> > >> > > >    *
> > >> > > >    23. *        }*
> > >> > > >    24.         if (thisElement.index() == thatElement.index()) {
> > >> > > >    25.
> > >> > > >    26.           result += thisElement.get() *
> thatElement.get();
> > >> > > >    27.           advanceThis = true;
> > >> > > >    28.           advanceThat = true;
> > >> > > >    29.         } else if (thisElement.index() <
> > >> thatElement.index()) {
> > >> > > >    30.           advanceThis = true;
> > >> > > >    31.           advanceThat = false;
> > >> > > >    32.         } else {
> > >> > > >    33.           advanceThis = false;
> > >> > > >    34.           advanceThat = true;
> > >> > > >    35.         }
> > >> > > >    36.       }
> > >> > > >
> > >> > > >
> > >> > > > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <
> > ted.dunning@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > The caller is not at fault here.  The problem is that hasNext
> is
> > >> > > > advancing
> > >> > > > > the iterator due to a side effect.  The side effect is
> > impossible
> > >> to
> > >> > > > avoid
> > >> > > > > at the level of the caller.
> > >> > > > >
> > >> > > > > Sent from my iPhone
> > >> > > > >
> > >> > > > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com>
> wrote:
> > >> > > > >
> > >> > > > > > I'm sure I did (at least much of) the AbstractIterator
> change
> > so
> > >> > > blame
> > >> > > > > > me... but I think the pattern itself is just fine. It's used
> > in
> > >> > many
> > >> > > > > > places in the project. Reusing the value object is a big win
> > in
> > >> > some
> > >> > > > > > places. Allocating objects is fast but a trillion of them
> > still
> > >> > adds
> > >> > > > > > up.
> > >> > > > > >
> > >> > > > > > It does contain a requirement, and that is that the caller
> is
> > >> > > supposed
> > >> > > > > > to copy/clone the value if it will be used at all after the
> > next
> > >> > > > > > iterator operation. That's the 0th option, to just fix the
> > >> caller
> > >> > > > > > here.
> > >> > > > > >
> > >> > > > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <
> > >> > ted.dunning@gmail.com>
> > >> > > > > wrote:
> > >> > > > > >> The contract of computeNext is that there are no side
> effects
> > >> > > visible
> > >> > > > > >> outside (i.e. apparent functional style).  This is required
> > >> since
> > >> > > > > >> computeNext is called from hasNext().
> > >> > > > > >>
> > >> > > > > >> We are using a side-effecting style so we have a bug.
> > >> > > > > >>
> > >> > > > > >> We have two choices:
> > >> > > > > >>
> > >> > > > > >> a) use functional style. This will *require* that we
> > allocate a
> > >> > new
> > >> > > > > >> container element on every call to computeNext.  This is
> best
> > >> for
> > >> > > the
> > >> > > > > user
> > >> > > > > >> because they will have fewer surprising bugs due to reuse.
> >  If
> > >> > > > > allocation
> > >> > > > > >> is actually as bad as some people think (I remain skeptical
> > of
> > >> > that
> > >> > > > > without
> > >> > > > > >> tests) then this is a bad move.  If allocation of totally
> > >> > ephemeral
> > >> > > > > objects
> > >> > > > > >> is as cheap as I think, then this would be a good move.
> > >> > > > > >>
> > >> > > > > >> b) stop using AbstractIterator and continue with the re-use
> > >> style.
> > >> > > >  And
> > >> > > > > add
> > >> > > > > >> a comment to prevent a bright spark from reverting this
> > change.
> > >> >  (I
> > >> > > > > suspect
> > >> > > > > >> that the bright spark who did this in the first place was
> me
> > >> so I
> > >> > > can
> > >> > > > be
> > >> > > > > >> rude)
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
Well... current iterator style with a non-side-effecting version of
hasNext(), of course.

Reusing the container is OK if the performance hit is substantial.


On Sun, Apr 14, 2013 at 5:02 PM, Robin Anil <ro...@gmail.com> wrote:

> Also the Tests crash due to excessive GC. The performance degradation there
> is very visible(my cpu spikes up). I think there is good case for the
> current iteration style, just that we have to, not use the java Iterator
> contract and confuse clients.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 6:59 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > Yes. All final.
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Sun, Apr 14, 2013 at 6:55 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >
> >> Did you mark the class and fields all as final?
> >>
> >> That might help the compiler realize it could in-line stuff and avoid
> the
> >> constructor (not likely, but possible)
> >>
> >>
> >> On Sun, Apr 14, 2013 at 4:52 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >>
> >> > With a new immutable Element in the iterator, the iteration behavior
> is
> >> > corrected but. There is a performance degradation of about 10% and
> >> > nullifies what I have done with the patch.
> >> >
> >> > See
> >> >
> >> >
> >>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> >> >
> >> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >> >
> >> >
> >> > On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <te...@gmail.com>
> >> > wrote:
> >> >
> >> > > Yeah... but we still have to fix the iterator.
> >> > >
> >> > >
> >> > > On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <ro...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Here is an iteration style that works as is with today's behaviour
> >> of
> >> > > > hasNext
> >> > > >
> >> > > >    1.
> >> > > >    2.  Element thisElement = null;
> >> > > >    3.       Element thatElement = null;
> >> > > >    4.       boolean advanceThis = true;
> >> > > >    5.       boolean advanceThat = true;
> >> > > >    6.
> >> > > >    7.       Iterator<Element> thisNonZero = this.iterateNonZero();
> >> > > >    8.       Iterator<Element> thatNonZero = x.iterateNonZero();
> >> > > >    9.
> >> > > >    10.       double result = 0.0;
> >> > > >    11.       while (true) {
> >> > > >    12.         *if (advanceThis) {
> >> > > >    *
> >> > > >    13. *          if (!thisNonZero.hasNext()) {
> >> > > >    *
> >> > > >    14. *            break;
> >> > > >    *
> >> > > >    15. *          }
> >> > > >    *
> >> > > >    16. *          thisElement = thisNonZero.next();
> >> > > >    *
> >> > > >    17. *        }
> >> > > >    *
> >> > > >    18. *        if (advanceThat) {
> >> > > >    *
> >> > > >    19. *          if (!thatNonZero.hasNext()) {
> >> > > >    *
> >> > > >    20. *            break;
> >> > > >    *
> >> > > >    21. *          }
> >> > > >    *
> >> > > >    22. *          thatElement = thatNonZero.next();
> >> > > >    *
> >> > > >    23. *        }*
> >> > > >    24.         if (thisElement.index() == thatElement.index()) {
> >> > > >    25.
> >> > > >    26.           result += thisElement.get() * thatElement.get();
> >> > > >    27.           advanceThis = true;
> >> > > >    28.           advanceThat = true;
> >> > > >    29.         } else if (thisElement.index() <
> >> thatElement.index()) {
> >> > > >    30.           advanceThis = true;
> >> > > >    31.           advanceThat = false;
> >> > > >    32.         } else {
> >> > > >    33.           advanceThis = false;
> >> > > >    34.           advanceThat = true;
> >> > > >    35.         }
> >> > > >    36.       }
> >> > > >
> >> > > >
> >> > > > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <
> ted.dunning@gmail.com
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > The caller is not at fault here.  The problem is that hasNext is
> >> > > > advancing
> >> > > > > the iterator due to a side effect.  The side effect is
> impossible
> >> to
> >> > > > avoid
> >> > > > > at the level of the caller.
> >> > > > >
> >> > > > > Sent from my iPhone
> >> > > > >
> >> > > > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com> wrote:
> >> > > > >
> >> > > > > > I'm sure I did (at least much of) the AbstractIterator change
> so
> >> > > blame
> >> > > > > > me... but I think the pattern itself is just fine. It's used
> in
> >> > many
> >> > > > > > places in the project. Reusing the value object is a big win
> in
> >> > some
> >> > > > > > places. Allocating objects is fast but a trillion of them
> still
> >> > adds
> >> > > > > > up.
> >> > > > > >
> >> > > > > > It does contain a requirement, and that is that the caller is
> >> > > supposed
> >> > > > > > to copy/clone the value if it will be used at all after the
> next
> >> > > > > > iterator operation. That's the 0th option, to just fix the
> >> caller
> >> > > > > > here.
> >> > > > > >
> >> > > > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <
> >> > ted.dunning@gmail.com>
> >> > > > > wrote:
> >> > > > > >> The contract of computeNext is that there are no side effects
> >> > > visible
> >> > > > > >> outside (i.e. apparent functional style).  This is required
> >> since
> >> > > > > >> computeNext is called from hasNext().
> >> > > > > >>
> >> > > > > >> We are using a side-effecting style so we have a bug.
> >> > > > > >>
> >> > > > > >> We have two choices:
> >> > > > > >>
> >> > > > > >> a) use functional style. This will *require* that we
> allocate a
> >> > new
> >> > > > > >> container element on every call to computeNext.  This is best
> >> for
> >> > > the
> >> > > > > user
> >> > > > > >> because they will have fewer surprising bugs due to reuse.
>  If
> >> > > > > allocation
> >> > > > > >> is actually as bad as some people think (I remain skeptical
> of
> >> > that
> >> > > > > without
> >> > > > > >> tests) then this is a bad move.  If allocation of totally
> >> > ephemeral
> >> > > > > objects
> >> > > > > >> is as cheap as I think, then this would be a good move.
> >> > > > > >>
> >> > > > > >> b) stop using AbstractIterator and continue with the re-use
> >> style.
> >> > > >  And
> >> > > > > add
> >> > > > > >> a comment to prevent a bright spark from reverting this
> change.
> >> >  (I
> >> > > > > suspect
> >> > > > > >> that the bright spark who did this in the first place was me
> >> so I
> >> > > can
> >> > > > be
> >> > > > > >> rude)
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Also the Tests crash due to excessive GC. The performance degradation there
is very visible(my cpu spikes up). I think there is good case for the
current iteration style, just that we have to, not use the java Iterator
contract and confuse clients.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 6:59 PM, Robin Anil <ro...@gmail.com> wrote:

> Yes. All final.
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 6:55 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> Did you mark the class and fields all as final?
>>
>> That might help the compiler realize it could in-line stuff and avoid the
>> constructor (not likely, but possible)
>>
>>
>> On Sun, Apr 14, 2013 at 4:52 PM, Robin Anil <ro...@gmail.com> wrote:
>>
>> > With a new immutable Element in the iterator, the iteration behavior is
>> > corrected but. There is a performance degradation of about 10% and
>> > nullifies what I have done with the patch.
>> >
>> > See
>> >
>> >
>> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>> >
>> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>> >
>> >
>> > On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > Yeah... but we still have to fix the iterator.
>> > >
>> > >
>> > > On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <ro...@gmail.com>
>> > wrote:
>> > >
>> > > > Here is an iteration style that works as is with today's behaviour
>> of
>> > > > hasNext
>> > > >
>> > > >    1.
>> > > >    2.  Element thisElement = null;
>> > > >    3.       Element thatElement = null;
>> > > >    4.       boolean advanceThis = true;
>> > > >    5.       boolean advanceThat = true;
>> > > >    6.
>> > > >    7.       Iterator<Element> thisNonZero = this.iterateNonZero();
>> > > >    8.       Iterator<Element> thatNonZero = x.iterateNonZero();
>> > > >    9.
>> > > >    10.       double result = 0.0;
>> > > >    11.       while (true) {
>> > > >    12.         *if (advanceThis) {
>> > > >    *
>> > > >    13. *          if (!thisNonZero.hasNext()) {
>> > > >    *
>> > > >    14. *            break;
>> > > >    *
>> > > >    15. *          }
>> > > >    *
>> > > >    16. *          thisElement = thisNonZero.next();
>> > > >    *
>> > > >    17. *        }
>> > > >    *
>> > > >    18. *        if (advanceThat) {
>> > > >    *
>> > > >    19. *          if (!thatNonZero.hasNext()) {
>> > > >    *
>> > > >    20. *            break;
>> > > >    *
>> > > >    21. *          }
>> > > >    *
>> > > >    22. *          thatElement = thatNonZero.next();
>> > > >    *
>> > > >    23. *        }*
>> > > >    24.         if (thisElement.index() == thatElement.index()) {
>> > > >    25.
>> > > >    26.           result += thisElement.get() * thatElement.get();
>> > > >    27.           advanceThis = true;
>> > > >    28.           advanceThat = true;
>> > > >    29.         } else if (thisElement.index() <
>> thatElement.index()) {
>> > > >    30.           advanceThis = true;
>> > > >    31.           advanceThat = false;
>> > > >    32.         } else {
>> > > >    33.           advanceThis = false;
>> > > >    34.           advanceThat = true;
>> > > >    35.         }
>> > > >    36.       }
>> > > >
>> > > >
>> > > > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <ted.dunning@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > The caller is not at fault here.  The problem is that hasNext is
>> > > > advancing
>> > > > > the iterator due to a side effect.  The side effect is impossible
>> to
>> > > > avoid
>> > > > > at the level of the caller.
>> > > > >
>> > > > > Sent from my iPhone
>> > > > >
>> > > > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com> wrote:
>> > > > >
>> > > > > > I'm sure I did (at least much of) the AbstractIterator change so
>> > > blame
>> > > > > > me... but I think the pattern itself is just fine. It's used in
>> > many
>> > > > > > places in the project. Reusing the value object is a big win in
>> > some
>> > > > > > places. Allocating objects is fast but a trillion of them still
>> > adds
>> > > > > > up.
>> > > > > >
>> > > > > > It does contain a requirement, and that is that the caller is
>> > > supposed
>> > > > > > to copy/clone the value if it will be used at all after the next
>> > > > > > iterator operation. That's the 0th option, to just fix the
>> caller
>> > > > > > here.
>> > > > > >
>> > > > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <
>> > ted.dunning@gmail.com>
>> > > > > wrote:
>> > > > > >> The contract of computeNext is that there are no side effects
>> > > visible
>> > > > > >> outside (i.e. apparent functional style).  This is required
>> since
>> > > > > >> computeNext is called from hasNext().
>> > > > > >>
>> > > > > >> We are using a side-effecting style so we have a bug.
>> > > > > >>
>> > > > > >> We have two choices:
>> > > > > >>
>> > > > > >> a) use functional style. This will *require* that we allocate a
>> > new
>> > > > > >> container element on every call to computeNext.  This is best
>> for
>> > > the
>> > > > > user
>> > > > > >> because they will have fewer surprising bugs due to reuse.  If
>> > > > > allocation
>> > > > > >> is actually as bad as some people think (I remain skeptical of
>> > that
>> > > > > without
>> > > > > >> tests) then this is a bad move.  If allocation of totally
>> > ephemeral
>> > > > > objects
>> > > > > >> is as cheap as I think, then this would be a good move.
>> > > > > >>
>> > > > > >> b) stop using AbstractIterator and continue with the re-use
>> style.
>> > > >  And
>> > > > > add
>> > > > > >> a comment to prevent a bright spark from reverting this change.
>> >  (I
>> > > > > suspect
>> > > > > >> that the bright spark who did this in the first place was me
>> so I
>> > > can
>> > > > be
>> > > > > >> rude)
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Yes. All final.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 6:55 PM, Ted Dunning <te...@gmail.com> wrote:

> Did you mark the class and fields all as final?
>
> That might help the compiler realize it could in-line stuff and avoid the
> constructor (not likely, but possible)
>
>
> On Sun, Apr 14, 2013 at 4:52 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > With a new immutable Element in the iterator, the iteration behavior is
> > corrected but. There is a performance degradation of about 10% and
> > nullifies what I have done with the patch.
> >
> > See
> >
> >
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
> >
> > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
> >
> >
> > On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Yeah... but we still have to fix the iterator.
> > >
> > >
> > > On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <ro...@gmail.com>
> > wrote:
> > >
> > > > Here is an iteration style that works as is with today's behaviour of
> > > > hasNext
> > > >
> > > >    1.
> > > >    2.  Element thisElement = null;
> > > >    3.       Element thatElement = null;
> > > >    4.       boolean advanceThis = true;
> > > >    5.       boolean advanceThat = true;
> > > >    6.
> > > >    7.       Iterator<Element> thisNonZero = this.iterateNonZero();
> > > >    8.       Iterator<Element> thatNonZero = x.iterateNonZero();
> > > >    9.
> > > >    10.       double result = 0.0;
> > > >    11.       while (true) {
> > > >    12.         *if (advanceThis) {
> > > >    *
> > > >    13. *          if (!thisNonZero.hasNext()) {
> > > >    *
> > > >    14. *            break;
> > > >    *
> > > >    15. *          }
> > > >    *
> > > >    16. *          thisElement = thisNonZero.next();
> > > >    *
> > > >    17. *        }
> > > >    *
> > > >    18. *        if (advanceThat) {
> > > >    *
> > > >    19. *          if (!thatNonZero.hasNext()) {
> > > >    *
> > > >    20. *            break;
> > > >    *
> > > >    21. *          }
> > > >    *
> > > >    22. *          thatElement = thatNonZero.next();
> > > >    *
> > > >    23. *        }*
> > > >    24.         if (thisElement.index() == thatElement.index()) {
> > > >    25.
> > > >    26.           result += thisElement.get() * thatElement.get();
> > > >    27.           advanceThis = true;
> > > >    28.           advanceThat = true;
> > > >    29.         } else if (thisElement.index() < thatElement.index())
> {
> > > >    30.           advanceThis = true;
> > > >    31.           advanceThat = false;
> > > >    32.         } else {
> > > >    33.           advanceThis = false;
> > > >    34.           advanceThat = true;
> > > >    35.         }
> > > >    36.       }
> > > >
> > > >
> > > > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > The caller is not at fault here.  The problem is that hasNext is
> > > > advancing
> > > > > the iterator due to a side effect.  The side effect is impossible
> to
> > > > avoid
> > > > > at the level of the caller.
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com> wrote:
> > > > >
> > > > > > I'm sure I did (at least much of) the AbstractIterator change so
> > > blame
> > > > > > me... but I think the pattern itself is just fine. It's used in
> > many
> > > > > > places in the project. Reusing the value object is a big win in
> > some
> > > > > > places. Allocating objects is fast but a trillion of them still
> > adds
> > > > > > up.
> > > > > >
> > > > > > It does contain a requirement, and that is that the caller is
> > > supposed
> > > > > > to copy/clone the value if it will be used at all after the next
> > > > > > iterator operation. That's the 0th option, to just fix the caller
> > > > > > here.
> > > > > >
> > > > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > > > wrote:
> > > > > >> The contract of computeNext is that there are no side effects
> > > visible
> > > > > >> outside (i.e. apparent functional style).  This is required
> since
> > > > > >> computeNext is called from hasNext().
> > > > > >>
> > > > > >> We are using a side-effecting style so we have a bug.
> > > > > >>
> > > > > >> We have two choices:
> > > > > >>
> > > > > >> a) use functional style. This will *require* that we allocate a
> > new
> > > > > >> container element on every call to computeNext.  This is best
> for
> > > the
> > > > > user
> > > > > >> because they will have fewer surprising bugs due to reuse.  If
> > > > > allocation
> > > > > >> is actually as bad as some people think (I remain skeptical of
> > that
> > > > > without
> > > > > >> tests) then this is a bad move.  If allocation of totally
> > ephemeral
> > > > > objects
> > > > > >> is as cheap as I think, then this would be a good move.
> > > > > >>
> > > > > >> b) stop using AbstractIterator and continue with the re-use
> style.
> > > >  And
> > > > > add
> > > > > >> a comment to prevent a bright spark from reverting this change.
> >  (I
> > > > > suspect
> > > > > >> that the bright spark who did this in the first place was me so
> I
> > > can
> > > > be
> > > > > >> rude)
> > > > >
> > > >
> > >
> >
>

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
Did you mark the class and fields all as final?

That might help the compiler realize it could in-line stuff and avoid the
constructor (not likely, but possible)


On Sun, Apr 14, 2013 at 4:52 PM, Robin Anil <ro...@gmail.com> wrote:

> With a new immutable Element in the iterator, the iteration behavior is
> corrected but. There is a performance degradation of about 10% and
> nullifies what I have done with the patch.
>
> See
>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>
> Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
>
>
> On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Yeah... but we still have to fix the iterator.
> >
> >
> > On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> > > Here is an iteration style that works as is with today's behaviour of
> > > hasNext
> > >
> > >    1.
> > >    2.  Element thisElement = null;
> > >    3.       Element thatElement = null;
> > >    4.       boolean advanceThis = true;
> > >    5.       boolean advanceThat = true;
> > >    6.
> > >    7.       Iterator<Element> thisNonZero = this.iterateNonZero();
> > >    8.       Iterator<Element> thatNonZero = x.iterateNonZero();
> > >    9.
> > >    10.       double result = 0.0;
> > >    11.       while (true) {
> > >    12.         *if (advanceThis) {
> > >    *
> > >    13. *          if (!thisNonZero.hasNext()) {
> > >    *
> > >    14. *            break;
> > >    *
> > >    15. *          }
> > >    *
> > >    16. *          thisElement = thisNonZero.next();
> > >    *
> > >    17. *        }
> > >    *
> > >    18. *        if (advanceThat) {
> > >    *
> > >    19. *          if (!thatNonZero.hasNext()) {
> > >    *
> > >    20. *            break;
> > >    *
> > >    21. *          }
> > >    *
> > >    22. *          thatElement = thatNonZero.next();
> > >    *
> > >    23. *        }*
> > >    24.         if (thisElement.index() == thatElement.index()) {
> > >    25.
> > >    26.           result += thisElement.get() * thatElement.get();
> > >    27.           advanceThis = true;
> > >    28.           advanceThat = true;
> > >    29.         } else if (thisElement.index() < thatElement.index()) {
> > >    30.           advanceThis = true;
> > >    31.           advanceThat = false;
> > >    32.         } else {
> > >    33.           advanceThis = false;
> > >    34.           advanceThat = true;
> > >    35.         }
> > >    36.       }
> > >
> > >
> > > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > The caller is not at fault here.  The problem is that hasNext is
> > > advancing
> > > > the iterator due to a side effect.  The side effect is impossible to
> > > avoid
> > > > at the level of the caller.
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com> wrote:
> > > >
> > > > > I'm sure I did (at least much of) the AbstractIterator change so
> > blame
> > > > > me... but I think the pattern itself is just fine. It's used in
> many
> > > > > places in the project. Reusing the value object is a big win in
> some
> > > > > places. Allocating objects is fast but a trillion of them still
> adds
> > > > > up.
> > > > >
> > > > > It does contain a requirement, and that is that the caller is
> > supposed
> > > > > to copy/clone the value if it will be used at all after the next
> > > > > iterator operation. That's the 0th option, to just fix the caller
> > > > > here.
> > > > >
> > > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <
> ted.dunning@gmail.com>
> > > > wrote:
> > > > >> The contract of computeNext is that there are no side effects
> > visible
> > > > >> outside (i.e. apparent functional style).  This is required since
> > > > >> computeNext is called from hasNext().
> > > > >>
> > > > >> We are using a side-effecting style so we have a bug.
> > > > >>
> > > > >> We have two choices:
> > > > >>
> > > > >> a) use functional style. This will *require* that we allocate a
> new
> > > > >> container element on every call to computeNext.  This is best for
> > the
> > > > user
> > > > >> because they will have fewer surprising bugs due to reuse.  If
> > > > allocation
> > > > >> is actually as bad as some people think (I remain skeptical of
> that
> > > > without
> > > > >> tests) then this is a bad move.  If allocation of totally
> ephemeral
> > > > objects
> > > > >> is as cheap as I think, then this would be a good move.
> > > > >>
> > > > >> b) stop using AbstractIterator and continue with the re-use style.
> > >  And
> > > > add
> > > > >> a comment to prevent a bright spark from reverting this change.
>  (I
> > > > suspect
> > > > >> that the bright spark who did this in the first place was me so I
> > can
> > > be
> > > > >> rude)
> > > >
> > >
> >
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
With a new immutable Element in the iterator, the iteration behavior is
corrected but. There is a performance degradation of about 10% and
nullifies what I have done with the patch.

See
https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Apr 14, 2013 at 11:28 AM, Ted Dunning <te...@gmail.com> wrote:

> Yeah... but we still have to fix the iterator.
>
>
> On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > Here is an iteration style that works as is with today's behaviour of
> > hasNext
> >
> >    1.
> >    2.  Element thisElement = null;
> >    3.       Element thatElement = null;
> >    4.       boolean advanceThis = true;
> >    5.       boolean advanceThat = true;
> >    6.
> >    7.       Iterator<Element> thisNonZero = this.iterateNonZero();
> >    8.       Iterator<Element> thatNonZero = x.iterateNonZero();
> >    9.
> >    10.       double result = 0.0;
> >    11.       while (true) {
> >    12.         *if (advanceThis) {
> >    *
> >    13. *          if (!thisNonZero.hasNext()) {
> >    *
> >    14. *            break;
> >    *
> >    15. *          }
> >    *
> >    16. *          thisElement = thisNonZero.next();
> >    *
> >    17. *        }
> >    *
> >    18. *        if (advanceThat) {
> >    *
> >    19. *          if (!thatNonZero.hasNext()) {
> >    *
> >    20. *            break;
> >    *
> >    21. *          }
> >    *
> >    22. *          thatElement = thatNonZero.next();
> >    *
> >    23. *        }*
> >    24.         if (thisElement.index() == thatElement.index()) {
> >    25.
> >    26.           result += thisElement.get() * thatElement.get();
> >    27.           advanceThis = true;
> >    28.           advanceThat = true;
> >    29.         } else if (thisElement.index() < thatElement.index()) {
> >    30.           advanceThis = true;
> >    31.           advanceThat = false;
> >    32.         } else {
> >    33.           advanceThis = false;
> >    34.           advanceThat = true;
> >    35.         }
> >    36.       }
> >
> >
> > On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > The caller is not at fault here.  The problem is that hasNext is
> > advancing
> > > the iterator due to a side effect.  The side effect is impossible to
> > avoid
> > > at the level of the caller.
> > >
> > > Sent from my iPhone
> > >
> > > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com> wrote:
> > >
> > > > I'm sure I did (at least much of) the AbstractIterator change so
> blame
> > > > me... but I think the pattern itself is just fine. It's used in many
> > > > places in the project. Reusing the value object is a big win in some
> > > > places. Allocating objects is fast but a trillion of them still adds
> > > > up.
> > > >
> > > > It does contain a requirement, and that is that the caller is
> supposed
> > > > to copy/clone the value if it will be used at all after the next
> > > > iterator operation. That's the 0th option, to just fix the caller
> > > > here.
> > > >
> > > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >> The contract of computeNext is that there are no side effects
> visible
> > > >> outside (i.e. apparent functional style).  This is required since
> > > >> computeNext is called from hasNext().
> > > >>
> > > >> We are using a side-effecting style so we have a bug.
> > > >>
> > > >> We have two choices:
> > > >>
> > > >> a) use functional style. This will *require* that we allocate a new
> > > >> container element on every call to computeNext.  This is best for
> the
> > > user
> > > >> because they will have fewer surprising bugs due to reuse.  If
> > > allocation
> > > >> is actually as bad as some people think (I remain skeptical of that
> > > without
> > > >> tests) then this is a bad move.  If allocation of totally ephemeral
> > > objects
> > > >> is as cheap as I think, then this would be a good move.
> > > >>
> > > >> b) stop using AbstractIterator and continue with the re-use style.
> >  And
> > > add
> > > >> a comment to prevent a bright spark from reverting this change.  (I
> > > suspect
> > > >> that the bright spark who did this in the first place was me so I
> can
> > be
> > > >> rude)
> > >
> >
>

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
Yeah... but we still have to fix the iterator.


On Sun, Apr 14, 2013 at 8:58 AM, Robin Anil <ro...@gmail.com> wrote:

> Here is an iteration style that works as is with today's behaviour of
> hasNext
>
>    1.
>    2.  Element thisElement = null;
>    3.       Element thatElement = null;
>    4.       boolean advanceThis = true;
>    5.       boolean advanceThat = true;
>    6.
>    7.       Iterator<Element> thisNonZero = this.iterateNonZero();
>    8.       Iterator<Element> thatNonZero = x.iterateNonZero();
>    9.
>    10.       double result = 0.0;
>    11.       while (true) {
>    12.         *if (advanceThis) {
>    *
>    13. *          if (!thisNonZero.hasNext()) {
>    *
>    14. *            break;
>    *
>    15. *          }
>    *
>    16. *          thisElement = thisNonZero.next();
>    *
>    17. *        }
>    *
>    18. *        if (advanceThat) {
>    *
>    19. *          if (!thatNonZero.hasNext()) {
>    *
>    20. *            break;
>    *
>    21. *          }
>    *
>    22. *          thatElement = thatNonZero.next();
>    *
>    23. *        }*
>    24.         if (thisElement.index() == thatElement.index()) {
>    25.
>    26.           result += thisElement.get() * thatElement.get();
>    27.           advanceThis = true;
>    28.           advanceThat = true;
>    29.         } else if (thisElement.index() < thatElement.index()) {
>    30.           advanceThis = true;
>    31.           advanceThat = false;
>    32.         } else {
>    33.           advanceThis = false;
>    34.           advanceThat = true;
>    35.         }
>    36.       }
>
>
> On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > The caller is not at fault here.  The problem is that hasNext is
> advancing
> > the iterator due to a side effect.  The side effect is impossible to
> avoid
> > at the level of the caller.
> >
> > Sent from my iPhone
> >
> > On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com> wrote:
> >
> > > I'm sure I did (at least much of) the AbstractIterator change so blame
> > > me... but I think the pattern itself is just fine. It's used in many
> > > places in the project. Reusing the value object is a big win in some
> > > places. Allocating objects is fast but a trillion of them still adds
> > > up.
> > >
> > > It does contain a requirement, and that is that the caller is supposed
> > > to copy/clone the value if it will be used at all after the next
> > > iterator operation. That's the 0th option, to just fix the caller
> > > here.
> > >
> > > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >> The contract of computeNext is that there are no side effects visible
> > >> outside (i.e. apparent functional style).  This is required since
> > >> computeNext is called from hasNext().
> > >>
> > >> We are using a side-effecting style so we have a bug.
> > >>
> > >> We have two choices:
> > >>
> > >> a) use functional style. This will *require* that we allocate a new
> > >> container element on every call to computeNext.  This is best for the
> > user
> > >> because they will have fewer surprising bugs due to reuse.  If
> > allocation
> > >> is actually as bad as some people think (I remain skeptical of that
> > without
> > >> tests) then this is a bad move.  If allocation of totally ephemeral
> > objects
> > >> is as cheap as I think, then this would be a good move.
> > >>
> > >> b) stop using AbstractIterator and continue with the re-use style.
>  And
> > add
> > >> a comment to prevent a bright spark from reverting this change.  (I
> > suspect
> > >> that the bright spark who did this in the first place was me so I can
> be
> > >> rude)
> >
>

Re: Odd vector iteration behavior

Posted by Robin Anil <ro...@gmail.com>.
Here is an iteration style that works as is with today's behaviour of
hasNext

   1.
   2.  Element thisElement = null;
   3.       Element thatElement = null;
   4.       boolean advanceThis = true;
   5.       boolean advanceThat = true;
   6.
   7.       Iterator<Element> thisNonZero = this.iterateNonZero();
   8.       Iterator<Element> thatNonZero = x.iterateNonZero();
   9.
   10.       double result = 0.0;
   11.       while (true) {
   12.         *if (advanceThis) {
   *
   13. *          if (!thisNonZero.hasNext()) {
   *
   14. *            break;
   *
   15. *          }
   *
   16. *          thisElement = thisNonZero.next();
   *
   17. *        }
   *
   18. *        if (advanceThat) {
   *
   19. *          if (!thatNonZero.hasNext()) {
   *
   20. *            break;
   *
   21. *          }
   *
   22. *          thatElement = thatNonZero.next();
   *
   23. *        }*
   24.         if (thisElement.index() == thatElement.index()) {
   25.
   26.           result += thisElement.get() * thatElement.get();
   27.           advanceThis = true;
   28.           advanceThat = true;
   29.         } else if (thisElement.index() < thatElement.index()) {
   30.           advanceThis = true;
   31.           advanceThat = false;
   32.         } else {
   33.           advanceThis = false;
   34.           advanceThat = true;
   35.         }
   36.       }


On Sat, Apr 13, 2013 at 1:47 AM, Ted Dunning <te...@gmail.com> wrote:

> The caller is not at fault here.  The problem is that hasNext is advancing
> the iterator due to a side effect.  The side effect is impossible to avoid
> at the level of the caller.
>
> Sent from my iPhone
>
> On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com> wrote:
>
> > I'm sure I did (at least much of) the AbstractIterator change so blame
> > me... but I think the pattern itself is just fine. It's used in many
> > places in the project. Reusing the value object is a big win in some
> > places. Allocating objects is fast but a trillion of them still adds
> > up.
> >
> > It does contain a requirement, and that is that the caller is supposed
> > to copy/clone the value if it will be used at all after the next
> > iterator operation. That's the 0th option, to just fix the caller
> > here.
> >
> > On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >> The contract of computeNext is that there are no side effects visible
> >> outside (i.e. apparent functional style).  This is required since
> >> computeNext is called from hasNext().
> >>
> >> We are using a side-effecting style so we have a bug.
> >>
> >> We have two choices:
> >>
> >> a) use functional style. This will *require* that we allocate a new
> >> container element on every call to computeNext.  This is best for the
> user
> >> because they will have fewer surprising bugs due to reuse.  If
> allocation
> >> is actually as bad as some people think (I remain skeptical of that
> without
> >> tests) then this is a bad move.  If allocation of totally ephemeral
> objects
> >> is as cheap as I think, then this would be a good move.
> >>
> >> b) stop using AbstractIterator and continue with the re-use style.  And
> add
> >> a comment to prevent a bright spark from reverting this change.  (I
> suspect
> >> that the bright spark who did this in the first place was me so I can be
> >> rude)
>

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
The caller is not at fault here.  The problem is that hasNext is advancing the iterator due to a side effect.  The side effect is impossible to avoid at the level of the caller.  

Sent from my iPhone

On Apr 12, 2013, at 12:22, Sean Owen <sr...@gmail.com> wrote:

> I'm sure I did (at least much of) the AbstractIterator change so blame
> me... but I think the pattern itself is just fine. It's used in many
> places in the project. Reusing the value object is a big win in some
> places. Allocating objects is fast but a trillion of them still adds
> up.
> 
> It does contain a requirement, and that is that the caller is supposed
> to copy/clone the value if it will be used at all after the next
> iterator operation. That's the 0th option, to just fix the caller
> here.
> 
> On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com> wrote:
>> The contract of computeNext is that there are no side effects visible
>> outside (i.e. apparent functional style).  This is required since
>> computeNext is called from hasNext().
>> 
>> We are using a side-effecting style so we have a bug.
>> 
>> We have two choices:
>> 
>> a) use functional style. This will *require* that we allocate a new
>> container element on every call to computeNext.  This is best for the user
>> because they will have fewer surprising bugs due to reuse.  If allocation
>> is actually as bad as some people think (I remain skeptical of that without
>> tests) then this is a bad move.  If allocation of totally ephemeral objects
>> is as cheap as I think, then this would be a good move.
>> 
>> b) stop using AbstractIterator and continue with the re-use style.  And add
>> a comment to prevent a bright spark from reverting this change.  (I suspect
>> that the bright spark who did this in the first place was me so I can be
>> rude)

Re: Odd vector iteration behavior

Posted by Sean Owen <sr...@gmail.com>.
I'm sure I did (at least much of) the AbstractIterator change so blame
me... but I think the pattern itself is just fine. It's used in many
places in the project. Reusing the value object is a big win in some
places. Allocating objects is fast but a trillion of them still adds
up.

It does contain a requirement, and that is that the caller is supposed
to copy/clone the value if it will be used at all after the next
iterator operation. That's the 0th option, to just fix the caller
here.

On Fri, Apr 12, 2013 at 7:49 PM, Ted Dunning <te...@gmail.com> wrote:
> The contract of computeNext is that there are no side effects visible
> outside (i.e. apparent functional style).  This is required since
> computeNext is called from hasNext().
>
> We are using a side-effecting style so we have a bug.
>
> We have two choices:
>
> a) use functional style. This will *require* that we allocate a new
> container element on every call to computeNext.  This is best for the user
> because they will have fewer surprising bugs due to reuse.  If allocation
> is actually as bad as some people think (I remain skeptical of that without
> tests) then this is a bad move.  If allocation of totally ephemeral objects
> is as cheap as I think, then this would be a good move.
>
> b) stop using AbstractIterator and continue with the re-use style.  And add
> a comment to prevent a bright spark from reverting this change.  (I suspect
> that the bright spark who did this in the first place was me so I can be
> rude)

Re: Odd vector iteration behavior

Posted by Ted Dunning <te...@gmail.com>.
The contract of computeNext is that there are no side effects visible
outside (i.e. apparent functional style).  This is required since
computeNext is called from hasNext().

We are using a side-effecting style so we have a bug.

We have two choices:

a) use functional style. This will *require* that we allocate a new
container element on every call to computeNext.  This is best for the user
because they will have fewer surprising bugs due to reuse.  If allocation
is actually as bad as some people think (I remain skeptical of that without
tests) then this is a bad move.  If allocation of totally ephemeral objects
is as cheap as I think, then this would be a good move.

b) stop using AbstractIterator and continue with the re-use style.  And add
a comment to prevent a bright spark from reverting this change.  (I suspect
that the bright spark who did this in the first place was me so I can be
rude)




On Fri, Apr 12, 2013 at 11:05 AM, Jake Mannix <ja...@gmail.com> wrote:

> This looks very wrong.  The iterators for SASV extend guava's
> AbstractIterator, but they do reuse the NonDefaultElement instance
> internally.   It *looks* like we're correctly satisfying the
> AbstractIterator#computeNext() contract, but we must not be if we're
> mutating on multiple hasNext() calls...
>
>
>
> On Fri, Apr 12, 2013 at 9:36 AM, Dan Filimon <dangeorge.filimon@gmail.com
> >wrote:
>
> > While looking at the patch for fixing the sparse vectors (MAHOUT-1190), I
> > started working with vector Iterators doing what I thought was
> reasonable.
> >
> > This is the important snippet:
> > [...]
> >         thisIterator = this.iterateNonZero();
> >         thatIterator = other.iterateNonZero();
> >         thisElement = thatElement = null;
> >         boolean advanceThis = true;
> >         boolean advanceThat = true;
> >         OrderedIntDoubleMapping thisUpdates = new
> > OrderedIntDoubleMapping();
> >
> >         while (thisIterator.hasNext() && thatIterator.hasNext()) {
> >           if (advanceThis) {
> >             thisElement = thisIterator.next();
> >           }
> >           if (advanceThat) {
> >             thatElement = thatIterator.next();
> >           }
> > [... advanceThis and advanceThat are set to true based on which iterator
> to
> > advance...]
> >
> > The problem here is that when calling next(), the iterator state gets
> > invalidated and when calling hasNext() the iterator will be advanced
> > accordingly and the element references will point to the next element
> > (which is mutated).
> >
> > So, if the indices start at:
> > 52 and 87
> > despite wanting to only advance the 52, since both were accessed with
> > next(), they are both modified.
> >
> > Here's another snippet with this behavior [1]:
> >
> >     Vector vector = new SequentialAccessSparseVector(100);
> >     vector.set(0, 1);
> >     vector.set(2, 2);
> >     vector.set(4, 3);
> >     vector.set(6, 4);
> >     Iterator<Vector.Element> vectorIterator = vector.iterateNonZero();
> >     Vector.Element element = null;
> >     int i = 0;
> >     while (vectorIterator.hasNext()) {
> >       if (i % 2 == 0) {
> >         element = vectorIterator.next();
> >       }
> >       System.out.printf("%d %d %f\n", i, element.index(), element.get());
> >       ++i;
> >     }
> >
> >
> > The output is:
> > 0 0 1.000000
> > 1 2 2.000000
> > 2 2 2.000000
> > 3 4 3.000000
> > 4 4 3.000000
> > 5 6 4.000000
> > 6 6 4.000000
> >
> > I expected it to be:
> > 0 0 1.000000
> > 1 0 1.000000
> > 2 2 2.000000
> > 3 2 2.000000
> > 4 4 3.000000
> > 5 4 3.000000
> > 6 6 4.000000
> >
> > So, I'm completely wrong. Is this just me not understanding what an
> > iterator is supposed to do?
> >
> > [1] https://gist.github.com/dfilimon/5373271
> >
>
>
>
> --
>
>   -jake
>

Re: Odd vector iteration behavior

Posted by Jake Mannix <ja...@gmail.com>.
This looks very wrong.  The iterators for SASV extend guava's
AbstractIterator, but they do reuse the NonDefaultElement instance
internally.   It *looks* like we're correctly satisfying the
AbstractIterator#computeNext() contract, but we must not be if we're
mutating on multiple hasNext() calls...



On Fri, Apr 12, 2013 at 9:36 AM, Dan Filimon <da...@gmail.com>wrote:

> While looking at the patch for fixing the sparse vectors (MAHOUT-1190), I
> started working with vector Iterators doing what I thought was reasonable.
>
> This is the important snippet:
> [...]
>         thisIterator = this.iterateNonZero();
>         thatIterator = other.iterateNonZero();
>         thisElement = thatElement = null;
>         boolean advanceThis = true;
>         boolean advanceThat = true;
>         OrderedIntDoubleMapping thisUpdates = new
> OrderedIntDoubleMapping();
>
>         while (thisIterator.hasNext() && thatIterator.hasNext()) {
>           if (advanceThis) {
>             thisElement = thisIterator.next();
>           }
>           if (advanceThat) {
>             thatElement = thatIterator.next();
>           }
> [... advanceThis and advanceThat are set to true based on which iterator to
> advance...]
>
> The problem here is that when calling next(), the iterator state gets
> invalidated and when calling hasNext() the iterator will be advanced
> accordingly and the element references will point to the next element
> (which is mutated).
>
> So, if the indices start at:
> 52 and 87
> despite wanting to only advance the 52, since both were accessed with
> next(), they are both modified.
>
> Here's another snippet with this behavior [1]:
>
>     Vector vector = new SequentialAccessSparseVector(100);
>     vector.set(0, 1);
>     vector.set(2, 2);
>     vector.set(4, 3);
>     vector.set(6, 4);
>     Iterator<Vector.Element> vectorIterator = vector.iterateNonZero();
>     Vector.Element element = null;
>     int i = 0;
>     while (vectorIterator.hasNext()) {
>       if (i % 2 == 0) {
>         element = vectorIterator.next();
>       }
>       System.out.printf("%d %d %f\n", i, element.index(), element.get());
>       ++i;
>     }
>
>
> The output is:
> 0 0 1.000000
> 1 2 2.000000
> 2 2 2.000000
> 3 4 3.000000
> 4 4 3.000000
> 5 6 4.000000
> 6 6 4.000000
>
> I expected it to be:
> 0 0 1.000000
> 1 0 1.000000
> 2 2 2.000000
> 3 2 2.000000
> 4 4 3.000000
> 5 4 3.000000
> 6 6 4.000000
>
> So, I'm completely wrong. Is this just me not understanding what an
> iterator is supposed to do?
>
> [1] https://gist.github.com/dfilimon/5373271
>



-- 

  -jake