You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Shashikant Kore (JIRA)" <ji...@apache.org> on 2009/08/19 16:01:14 UTC

[jira] Created: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Using better primitives hash for sparse vector for performance gains
--------------------------------------------------------------------

                 Key: MAHOUT-165
                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
             Project: Mahout
          Issue Type: Improvement
          Components: Matrix
    Affects Versions: 0.2
            Reporter: Shashikant Kore
             Fix For: 0.2
         Attachments: mahout-165.patch

In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 

In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 

Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Sep 30, 2009 at 1:16 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Sep 30, 2009, at 4:03 PM, Jake Mannix wrote:
>
>  Regarding having equals() effectively delegate to
>> getName().equals(other.getName()) && equivalent(other) means that we need
>> to
>> be extra special careful about implementations of hashCode() :
>>
>> If we are not going to break the contract between equals() and hashCode(),
>> and we're having equals() *only* take into account the mathematical
>> contents
>> and the name, then I'd say what we need to do is implement hashCode() in a
>> top level class like AbstractVector.
>>
>
> That is what is happening.


It is on trunk, but not in Ted's patch, which is what I'm currently looking
at, and want
to make sure I'm adhering to convention as I play with Ted's impls.

  -jake

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 30, 2009, at 4:03 PM, Jake Mannix wrote:

> Regarding having equals() effectively delegate to
> getName().equals(other.getName()) && equivalent(other) means that we  
> need to
> be extra special careful about implementations of hashCode() :
>
> If we are not going to break the contract between equals() and  
> hashCode(),
> and we're having equals() *only* take into account the mathematical  
> contents
> and the name, then I'd say what we need to do is implement hashCode 
> () in a
> top level class like AbstractVector.

That is what is happening.

>
> (Is something funny going on with JIRA?  Seems broken...)

Yes, there is something wrong.  Infra is aware of it.


>
>  -jake
>
> On Wed, Sep 30, 2009 at 10:01 AM, Sean Owen (JIRA) <ji...@apache.org>  
> wrote:
>
>>
>>   [
>> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760956#action_12760956]
>>
>> Sean Owen commented on MAHOUT-165:
>> ----------------------------------
>>
>> Are my conclusions sound then:
>>
>> We agree that equals() should be 'pretty strict'. The conventional  
>> Java
>> wisdom is that equals(), in fact, ought not return true for  
>> instances of
>> differing classes, unless you really know what you're doing. I  
>> guess we do.
>> :)
>>
>> If the idea behind equals() is "do class-specific stuff, otherwise,  
>> check
>> names, and use equivalent() then", then we don't need  
>> strictEquivalence() --
>> where's it used?
>>
>> (If I represented the logic correctly above -- is that as simple as  
>> we can
>> make it? seems a touch complex)
>>
>> I am not sure anything is 'broken' in practice here but I sense it  
>> could be
>> simpler.
>>
>>
>>> Using better primitives hash for sparse vector for performance gains
>>> --------------------------------------------------------------------
>>>
>>>                Key: MAHOUT-165
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-165
>>>            Project: Mahout
>>>         Issue Type: Improvement
>>>         Components: Matrix
>>>   Affects Versions: 0.2
>>>           Reporter: Shashikant Kore
>>>           Assignee: Grant Ingersoll
>>>            Fix For: 0.2
>>>
>>>        Attachments: colt.jar, mahout-165-trove.patch,  
>>> MAHOUT-165.patch,
>> mahout-165.patch
>>>
>>>
>>> In SparseVector, we need primitives hash map for index and values.  
>>> The
>> present implementation of this hash map is not as efficient as some  
>> of the
>> other implementations in non-Apache projects.
>>> In an experiment, I found that, for get/set operations, the  
>>> primitive
>> hash of  Colt performance an order of magnitude better than
>> OrderedIntDoubleMapping. For iteration it is 2x slower, though.
>>> Using Colt in Sparsevector improved performance of canopy  
>>> generation. For
>> an experimental dataset, the current implementation takes 50  
>> minutes. Using
>> Colt, reduces this duration to 19-20 minutes. That's 60% reduction  
>> in the
>> delay.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Sean Owen <sr...@gmail.com>.

(PS yeah that was my fault for misreading the original message.)

On Thu, Oct 1, 2009 at 11:39 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Sep 30, 2009, at 4:34 PM, Jake Mannix wrote:
>
>> I didn't say that equals() should ignore name, I said the opposite -
>> equals
>> and
>> hashCode() should *only* take into account the contents and the name, and
>> not
>> implementation (which means that hashCode() needs to stay in one place and
>> not get monkeyed with in subclasses.
>>
>
> Right, that is my understanding

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 30, 2009, at 4:34 PM, Jake Mannix wrote:

> I didn't say that equals() should ignore name, I said the opposite -  
> equals
> and
> hashCode() should *only* take into account the contents and the  
> name, and
> not
> implementation (which means that hashCode() needs to stay in one  
> place and
> not get monkeyed with in subclasses.
>

Right, that is my understanding

> On Wed, Sep 30, 2009 at 1:18 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> No I don't hear anyone wanting to make equals() ignore the name.
>> (Otherwise, hashCode() would have to ignore it as well.)
>>
>> JIRA also seems pretty laggy to me.
>>
>> On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix <ja...@gmail.com>
>> wrote:
>>> If we are not going to break the contract between equals() and
>> hashCode(),
>>> and we're having equals() *only* take into account the mathematical
>> contents
>>> and the name, then I'd say what we need to do is implement hashCode 
>>> () in
>> a
>>> top level class like AbstractVector.
>>>
>>> (Is something funny going on with JIRA?  Seems broken...)
>>

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Jake Mannix <ja...@gmail.com>.

I didn't say that equals() should ignore name, I said the opposite - equals
and
hashCode() should *only* take into account the contents and the name, and
not
implementation (which means that hashCode() needs to stay in one place and
not get monkeyed with in subclasses.

On Wed, Sep 30, 2009 at 1:18 PM, Sean Owen <sr...@gmail.com> wrote:

> No I don't hear anyone wanting to make equals() ignore the name.
> (Otherwise, hashCode() would have to ignore it as well.)
>
> JIRA also seems pretty laggy to me.
>
> On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> > If we are not going to break the contract between equals() and
> hashCode(),
> > and we're having equals() *only* take into account the mathematical
> contents
> > and the name, then I'd say what we need to do is implement hashCode() in
> a
> > top level class like AbstractVector.
> >
> > (Is something funny going on with JIRA?  Seems broken...)
>

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Sean Owen <sr...@gmail.com>.

No I don't hear anyone wanting to make equals() ignore the name.
(Otherwise, hashCode() would have to ignore it as well.)

JIRA also seems pretty laggy to me.

On Wed, Sep 30, 2009 at 9:03 PM, Jake Mannix <ja...@gmail.com> wrote:
> If we are not going to break the contract between equals() and hashCode(),
> and we're having equals() *only* take into account the mathematical contents
> and the name, then I'd say what we need to do is implement hashCode() in a
> top level class like AbstractVector.
>
> (Is something funny going on with JIRA?  Seems broken...)

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Jake Mannix <ja...@gmail.com>.

Regarding having equals() effectively delegate to
getName().equals(other.getName()) && equivalent(other) means that we need to
be extra special careful about implementations of hashCode() :

If we are not going to break the contract between equals() and hashCode(),
and we're having equals() *only* take into account the mathematical contents
and the name, then I'd say what we need to do is implement hashCode() in a
top level class like AbstractVector.

(Is something funny going on with JIRA?  Seems broken...)

  -jake

On Wed, Sep 30, 2009 at 10:01 AM, Sean Owen (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760956#action_12760956]
>
> Sean Owen commented on MAHOUT-165:
> ----------------------------------
>
> Are my conclusions sound then:
>
> We agree that equals() should be 'pretty strict'. The conventional Java
> wisdom is that equals(), in fact, ought not return true for instances of
> differing classes, unless you really know what you're doing. I guess we do.
> :)
>
> If the idea behind equals() is "do class-specific stuff, otherwise, check
> names, and use equivalent() then", then we don't need strictEquivalence() --
> where's it used?
>
> (If I represented the logic correctly above -- is that as simple as we can
> make it? seems a touch complex)
>
> I am not sure anything is 'broken' in practice here but I sense it could be
> simpler.
>
>
> > Using better primitives hash for sparse vector for performance gains
> > --------------------------------------------------------------------
> >
> >                 Key: MAHOUT-165
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Matrix
> >    Affects Versions: 0.2
> >            Reporter: Shashikant Kore
> >            Assignee: Grant Ingersoll
> >             Fix For: 0.2
> >
> >         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch,
> mahout-165.patch
> >
> >
> > In SparseVector, we need primitives hash map for index and values. The
> present implementation of this hash map is not as efficient as some of the
> other implementations in non-Apache projects.
> > In an experiment, I found that, for get/set operations, the primitive
> hash of  Colt performance an order of magnitude better than
> OrderedIntDoubleMapping. For iteration it is 2x slower, though.
> > Using Colt in Sparsevector improved performance of canopy generation. For
> an experimental dataset, the current implementation takes 50 minutes. Using
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the
> delay.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Ted Dunning <te...@gmail.com>.

I have just written a replacement.  I will post a patch as soon as I get
some solid testing done.

On Sat, Aug 29, 2009 at 2:29 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Right, Colt likely could be used depending on the package it comes from and
> as long as it doesn't have deps on the other packages.
>
> -Grant
>
>
> On Aug 29, 2009, at 2:22 PM, Ted Dunning wrote:
>
>  Trove is LGPL so we can't lift code.  Even linking can be tricky.
>>
>> On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) <jira@apache.org
>> >wrote:
>>
>>
>>>  [
>>>
>>> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904
>>> #action_12748904]
>>>
>>> Shashikant Kore commented on MAHOUT-165:
>>> ----------------------------------------
>>>
>>> I'm fine with copying relevant classes from Colt or Trove.
>>>
>>> Please let me know your library of choice. I will create the patch and
>>> upload.
>>>
>>>
>>>
>>>  Using better primitives hash for sparse vector for performance gains
>>>> --------------------------------------------------------------------
>>>>
>>>>               Key: MAHOUT-165
>>>>               URL: https://issues.apache.org/jira/browse/MAHOUT-165
>>>>           Project: Mahout
>>>>        Issue Type: Improvement
>>>>        Components: Matrix
>>>>  Affects Versions: 0.2
>>>>          Reporter: Shashikant Kore
>>>>           Fix For: 0.2
>>>>
>>>>       Attachments: mahout-165.patch
>>>>
>>>>
>>>> In SparseVector, we need primitives hash map for index and values. The
>>>>
>>> present implementation of this hash map is not as efficient as some of
>>> the
>>> other implementations in non-Apache projects.
>>>
>>>> In an experiment, I found that, for get/set operations, the primitive
>>>>
>>> hash of  Colt performance an order of magnitude better than
>>> OrderedIntDoubleMapping. For iteration it is 2x slower, though.
>>>
>>>> Using Colt in Sparsevector improved performance of canopy generation.
>>>> For
>>>>
>>> an experimental dataset, the current implementation takes 50 minutes.
>>> Using
>>> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the
>>> delay.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Grant Ingersoll <gs...@apache.org>.

Right, Colt likely could be used depending on the package it comes  
from and as long as it doesn't have deps on the other packages.

-Grant

On Aug 29, 2009, at 2:22 PM, Ted Dunning wrote:

> Trove is LGPL so we can't lift code.  Even linking can be tricky.
>
> On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) <jira@apache.org 
> >wrote:
>
>>
>>   [
>> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904 
>> #action_12748904]
>>
>> Shashikant Kore commented on MAHOUT-165:
>> ----------------------------------------
>>
>> I'm fine with copying relevant classes from Colt or Trove.
>>
>> Please let me know your library of choice. I will create the patch  
>> and
>> upload.
>>
>>
>>
>>> Using better primitives hash for sparse vector for performance gains
>>> --------------------------------------------------------------------
>>>
>>>                Key: MAHOUT-165
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-165
>>>            Project: Mahout
>>>         Issue Type: Improvement
>>>         Components: Matrix
>>>   Affects Versions: 0.2
>>>           Reporter: Shashikant Kore
>>>            Fix For: 0.2
>>>
>>>        Attachments: mahout-165.patch
>>>
>>>
>>> In SparseVector, we need primitives hash map for index and values.  
>>> The
>> present implementation of this hash map is not as efficient as some  
>> of the
>> other implementations in non-Apache projects.
>>> In an experiment, I found that, for get/set operations, the  
>>> primitive
>> hash of  Colt performance an order of magnitude better than
>> OrderedIntDoubleMapping. For iteration it is 2x slower, though.
>>> Using Colt in Sparsevector improved performance of canopy  
>>> generation. For
>> an experimental dataset, the current implementation takes 50  
>> minutes. Using
>> Colt, reduces this duration to 19-20 minutes. That's 60% reduction  
>> in the
>> delay.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Ted Dunning <te...@gmail.com>.

Trove is LGPL so we can't lift code.  Even linking can be tricky.

On Fri, Aug 28, 2009 at 10:06 AM, Shashikant Kore (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904#action_12748904]
>
> Shashikant Kore commented on MAHOUT-165:
> ----------------------------------------
>
> I'm fine with copying relevant classes from Colt or Trove.
>
> Please let me know your library of choice. I will create the patch and
> upload.
>
>
>
> > Using better primitives hash for sparse vector for performance gains
> > --------------------------------------------------------------------
> >
> >                 Key: MAHOUT-165
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Matrix
> >    Affects Versions: 0.2
> >            Reporter: Shashikant Kore
> >             Fix For: 0.2
> >
> >         Attachments: mahout-165.patch
> >
> >
> > In SparseVector, we need primitives hash map for index and values. The
> present implementation of this hash map is not as efficient as some of the
> other implementations in non-Apache projects.
> > In an experiment, I found that, for get/set operations, the primitive
> hash of  Colt performance an order of magnitude better than
> OrderedIntDoubleMapping. For iteration it is 2x slower, though.
> > Using Colt in Sparsevector improved performance of canopy generation. For
> an experimental dataset, the current implementation takes 50 minutes. Using
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the
> delay.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Ted Dunning <te...@gmail.com>.

Colt did a nice job of this.  Basically, their idea was to take various
general functional patterns and allow the functions to be plugged in.

Common patterns that are reasonable to include in such a framework include:

a) dot product as the aggregration of a pairwise function application
(normal dot product is this with aggregation = + and pairwise function = *)

b) element-wise transformation of all elements of a matrix or vector

c) aggregation of an elementwise transformation of a vector or matrix (sum,
frobenius norm, euclidean squared distance are examples of this)

c) row-wise or column-wise transformation of a matrix resulting in a list of
objects

d) row-wise or column-wise aggregation of a matrix resulting in a vector
(row sum or column sum is a good example here)

e) row-wise or column-wise aggregated outer products (normal matrix
multiplication is an example where the product is dot and aggregation is +)

Combining these operations with various view operators such as transpose,
sub-matrix and diagonal allow various other interesting operators to be
implemented.  Trace, for example, becomes
m.diagonal().aggregate(Functions.plus).

The result is a very powerful API that has enormous expressivity.

It is also, unfortunately, essentially opaque to most users so lots of
short-cut and convenience functions are important.

On Thu, Oct 1, 2009 at 10:53 AM, Jake Mannix <ja...@gmail.com> wrote:

> Of course, to do this right in Mahout, we'd probably need to have some way
> of
> telling the Vector instance what to use for its norm (so it knows whether
> to
>
> cache it's L^2 norm, L^p norm, or some other inner product applied to
> itself).
>
> Which gets me thinking: maybe we should have an InnerProduct interface,
> similar to DistanceMeasure, which defined how to compute dot().  As it is,
> we basically assume in our API that while you may want norm(int p) for
> various p, and you may want different DistanceMeasure impls of various
> types, we say that dot() is always the Euclidean dot().
>
> Should that be pluggable as well?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 1, 2009 at 10:10 AM, Ted Dunning <te...@gmail.com> wrote:

> Btw... the other think that the HashVector does better is inserts.  The
> sorted vector could do much better on average if it deferred sorting until
> an access or iteration was done.  Even iteration doesn't necessarily need
> sorting, but it could by seen as part of the contract.
>

Iteration doesn't *need* sorting, but I agree that it should be part of the
contract
because doing fast dot products of two OrderedIntDoublePair instances really
needs ordering, so you can just walk them both in parallel.

In decomposer, I ended up having an ImmutableSparseVector subclass which
did the sorting in the constructor (if it wasn't already sorted) and then
didn't
allow further inserts/deletes/modifications.

Speaking of other little niceties: caching the norm is something I found
pretty
useful for a vector, and keeping track of a boolean "isDirty" which lets you

know whether you've invalidated it and will need to recalculate on the next
call to norm().

Of course, to do this right in Mahout, we'd probably need to have some way
of
telling the Vector instance what to use for its norm (so it knows whether to

cache it's L^2 norm, L^p norm, or some other inner product applied to
itself).

Which gets me thinking: maybe we should have an InnerProduct interface,
similar to DistanceMeasure, which defined how to compute dot().  As it is,
we basically assume in our API that while you may want norm(int p) for
various p, and you may want different DistanceMeasure impls of various
types, we say that dot() is always the Euclidean dot().

Should that be pluggable as well?

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Ted Dunning <te...@gmail.com>.

Btw... the other think that the HashVector does better is inserts.  The
sorted vector could do much better on average if it deferred sorting until
an access or iteration was done.  Even iteration doesn't necessarily need
sorting, but it could by seen as part of the contract.

On Thu, Oct 1, 2009 at 9:36 AM, Jake Mannix <ja...@gmail.com> wrote:

> Yeah, I added the "trying to find..." part of the debug output because I
> couldn't
> figure out what IntDoubleHash was "impossibly confused" about.
>
> Unfortunately, seeing what it was confused about only confused *me* about
> why it was impossible.
>
> On Thu, Oct 1, 2009 at 9:12 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > It indicates a bug in the code or in the writer's head.
> >
> > You are correct about the intent.  The default value (which should
> probably
> > just be 0) should be returned if the value is missing.
> >
> > On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA) <jira@apache.org
> > >wrote:
> >
> > > I don't think this is "impossible confusion", it's just simply supposed
> > to
> > > return a 0 when it can't find the index, right?
> >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Jake Mannix <ja...@gmail.com>.

Yeah, I added the "trying to find..." part of the debug output because I
couldn't
figure out what IntDoubleHash was "impossibly confused" about.

Unfortunately, seeing what it was confused about only confused *me* about
why it was impossible.

On Thu, Oct 1, 2009 at 9:12 AM, Ted Dunning <te...@gmail.com> wrote:

> It indicates a bug in the code or in the writer's head.
>
> You are correct about the intent.  The default value (which should probably
> just be 0) should be returned if the value is missing.
>
> On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA) <jira@apache.org
> >wrote:
>
> > I don't think this is "impossible confusion", it's just simply supposed
> to
> > return a 0 when it can't find the index, right?
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: [jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by Ted Dunning <te...@gmail.com>.

It indicates a bug in the code or in the writer's head.

You are correct about the intent.  The default value (which should probably
just be 0) should be returned if the value is missing.

On Thu, Oct 1, 2009 at 5:41 AM, Grant Ingersoll (JIRA) <ji...@apache.org>wrote:

> I don't think this is "impossible confusion", it's just simply supposed to
> return a 0 when it can't find the index, right?

-- 
Ted Dunning, CTO
DeepDyve

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779406#action_12779406 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

My patch is only for the changes to SparseVector. I haven't applied Drew's patch. If you wish to apply my patch on Drews patch, then you need to change the import statements from cern.colt.* to org.apache.mahout.colt.*



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757017#action_12757017 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

Colt handles the removal by explicitly keeping an array of states. It's an additional array of bytes. 



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745176#action_12745176 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------

This is an important result, although as Sean mentioned the current class is a place-holder so the results are not all that surprising.

On the other hand, Colt has license issues.

In contrast, MTJ does not, provides even higher performance and is in the process of being integrated into Commons Math.  That effort was being held up pending the Math 2.0 release, but 2.0 is out and so should be moving forward.  I will ping Luc about this.  He wants closer ties between Math and Mahout and I think that makes a world of sense.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760902#action_12760902 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

There are some thoughts on equals, etc. in the archives and other JIRA issues. 

Here's what I recall:

1. We want DenseVectors and SparseVectors with the same names to be equal in the equals() sense.  The implementations of equals in SparseVector and DenseVector are the equivalent, AFAICT, as the implementations in equals(), but for #2:
2. We don't just defer to strictEquivalence b/c the thinking is that we can do much faster equals comparison if we know what type of vector it is, which is why SparseVector checks to see if "that" is a SparseVector, otherwise deferring to equivalent (since names have already been checked).  I haven't validated whether they truly are faster.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761000#action_12761000 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------


I will take a quick look this evening at the patch and see if I can bring it back to life.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760893#action_12760893 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

It's a good, important question and one I think needs attention too. (I think there used to be even a 4th such method!)

equals() is tricky here since if you define equals() to include only the numeric values in the vector (which is reasonable) then you have some transitivity problems. A named vector may equals some non-named Vector implementation may equal some other named vector with different name, which implies the two named vectors with different names are equal.

I believe I and conventional wisdom favor defining equals() more strictly then to avoid surprises. Therefore I would implement equals() as a call to strictEquivalence() -- in fact, would remove that method and make it the implementation of equals(). (In fact, equals() *must* take account of name or else it is inconsistent with hashCode()).

equivalent() then stands on its own as a usefully different method, but I don't think it necessarily should be static, for consistency. Also, equals() can then use equivalent() to check that aspect of equality rather than duplicating it.

I think if we take these steps, IMHO this aspect makes much more sense and is more correct.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762167#action_12762167 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

I am trying out this patch. Somehow, I find it extremely slow compared to colt. 

I am running KMeans on 100k test vectors. With colt, all 5 iterations and clustering finished in 8 minutes. With this patch, it's been an hour and it hasn't even completed 50% of Iteration 1. (Iteration 0 was completed in 3 minutes.)   I checked kmeans.Cluster.java and verified that correct method on distance measure is called (one with 3 parameters). It is correct, and it can be verified by the quick completion of Iteration 0.

I am not not able to understand this behaviour. 



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754586#action_12754586 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Ted, can you bring your patch up to date with trunk?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753583#action_12753583 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

The attached patch uses integer to double map. I was wondering if precision of float is good enough for us. That reduces the memory consumption by one-third.  (int+float instead of int+double)


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-165:
-----------------------------------

    Attachment: mahout-165.patch

This gets the VectorTest testEquals to pass.  Also fixes an instanceof check in the equals of OrderedIntDoubleVector.

Still a failure in VectorTest testHashCode due to the fact that the HashVector doesn't gracefully handle missing index items.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761021#action_12761021 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------


THanks Jake, that could be very helpful.

The throwing of "Impossible confusion" is done in situations where an impossible condition has been detected.  For instance, since hash tables are resized when they become partially filled, it should be impossible for the search loop to exit without finding an empty cell or a match.  When programming, I have difficulty pronouncing "should" so I try to detect the situation and signal it with an unchecked exception.  I usually define something like "ImpossibleConditionException", but didn't in this case.  I use an unchecked exception because it is clear that the application is not going to be much able to recover from a situation that I don't think could occur.

I left the hard-coding of one option or the other in place because I could see my patch extending into everything everywhere and wanted to limit the scope of the change.  You are right that we need to think about how that works.  In most cases, I think that hard-coding is fine just like hard-coding the use of an ArrayList in some application is not subject to user over-ride.  There are a few cases where this isn't try, but I think that usually that means that the vector or matrix should be passed in.  The use of like() may also be indicated.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779451#action_12779451 ] 

Drew Farris commented on MAHOUT-165:
------------------------------------

{quote}
Awesome, thanks Drew. I noticed you didn't add a

{noformat}
<module>colt</module>
{noformat}

line inside of the top level pom, is this to hide it from being depended on? I just ask because it meant that IntelliJ didn't seem to want to consider the mahout-colt submodule to be a real maven submodule without that there.
{quote}

Jake, the module definition for colt is defined in the colt profile which starts on line 111. This way, we won't build colt unless the colt profile is activated using the -Pcolt argument on the command-line. I haven't switched to IntelliJ yet, but I would be surprised if there were not some option to activate certain profiles in the maven project configuration somewhere. 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781436#action_12781436 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

OK, I moved over the matrix module, but there still needs to be some refactoring done there, as there are currently two layers of matrix packages.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778831#action_12778831 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Yep, I think we are all agreed on Colt.  I'll get 0.2 out and then we can add it.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779036#action_12779036 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

And continuing on the topic of code adoption: Colt has no unit tests.   *eep*

So we might need to draft up a nice effort towards writing a bunch, once we get this incorporated.

I guess we can do this piece-by-piece - include the code first, but don't change any implementation to use Colt.  Then as we hook it into things, add more unit tests and verify that they work, and gradually get more test coverage over the adopted code.  We'd need to label it in the documentation as "no lifeguard present: for expert use only" until it was better tested in this case, because once it was available, people using Mahout could conceivably just start coding using Colt primitives even if we aren't directly linking to them at first in our own classes.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779132#action_12779132 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

bq. I also like the idea of getting it donated to Apache, since it is cleanest and consistent with the intent here - to move it forward in an Apache project.

Well, I'm not sure how much of the "making a whole ASF project" overhead is necessary just *yet* (given how much work goes into a new project), but at least having it live someplace like google-code would give it that option.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779058#action_12779058 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------

bq.    The colt tree could also be put into a separate module that lives alongside core, util, examples, built independently as a part of the maven build - optionally at first, activated via a build profile.

bq. +1 - I like this.

+1 as well.

bq.    As far as package names, would it be better to map cern.colt.* to org.apache.mahout.colt.* ? - that way there's no potential for the old being confused for the new in builds, etc.

bq. I personally think this is the way to go, but does it reduce confusion, or increase it? People who are used to using colt will see familiar classes, but in strange places. If we're really going to overhaul the whole library over time, this makes sense, I guess.

Absolutely.

Should all code be marked deprecated until it has unit tests with a comment to say why it is deprecated?  That would give us a clear visual signal that the lifeguard has left the pool.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761201#action_12761201 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

The exception in the test is:
{quote}
java.lang.RuntimeException: Impossible confusion in IntDoubleHash: trying to find index 1 in (2,2.0), (0,1.0)
	at org.apache.mahout.matrix.IntDoubleHash.find(IntDoubleHash.java:162)
	at org.apache.mahout.matrix.IntDoubleHash.getInternal(IntDoubleHash.java:194)
	at org.apache.mahout.matrix.IntDoubleHash.get(IntDoubleHash.java:253)
	at org.apache.mahout.matrix.HashVector.equivalent(HashVector.java:219)
	at org.apache.mahout.matrix.AbstractVector.equals(AbstractVector.java:426)
	at org.apache.mahout.matrix.VectorTest.testHashCode(VectorTest.java:448)
{quote}

I don't think this is "impossible confusion", it's just simply supposed to return a 0 when it can't find the index, right?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778822#action_12778822 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

Not sure if voting for my own patch qualifies for good behavior. 

The Colt patch has extremely localized changes - only one class to be precise.  If tomorrow we want to move to something better, Colt has rather low exit barrier. Just rip off the SparseVector code and you are done.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761018#action_12761018 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Ted, some notes on your patch: 

  * with the two different specialized subclasses of SparseVector (HashVector, optimized for random access, and OrderedIntDoubleVector, optimized for iteration speed) being created, it seems like utilities like the TFDFMapper and so forth should be able to chose which impl to use, instead of getting hardcoded to use on or the other.

  * also, your current implementation of IntDoubleHash appears to sometimes throw "java.lang.RuntimeException: Impossible confusion in IntDoubleHash" exceptions sometimes, which sounds troubling. :)   

Attached is my current attempt at reviving this patch.  Currently has failing tests.  If it's easier to just svn up your own version, feel free to ignore, but I thought applying one which already compiles might help a little.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779139#action_12779139 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------

bq. Well, I'm not sure how much of the "making a whole ASF project" overhead is necessary just yet (given how much work goes into a new project), but at least having it live someplace like google-code would give it that option.

Grant,

In light of this, can you clarify your comment about Google Code?

Were you really thinking of forking Colt onto Google code, then making it a TLP in Apache? 

Or did you imagine that forking it onto Google code would be done in order to establish provenance of the code before bringing it under mahout?

Would importing it into Mahout and then budding it out at the right time be a viable alternative to that?  Can a sub-project have a sub-sub-project?  (:-) in case you didn't notice)

Making Colt a TLP or a commons project is an attractive long run idea, but getting it into Mahout now is nicer.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779082#action_12779082 ] 

Drew Farris commented on MAHOUT-165:
------------------------------------

{quote}
Is it ok if I post a patch which has the org.apache.mahout.colt.* code just living inside of core/source/main/java, and then someone else can help out by refactoring my patch to have the right build / maven setup? Any volunteers for that?
{quote}

I know maven well enough to do this. If you post a patch, I'll do the refactoring necessary to get it set up as a module.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751398#action_12751398 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

My interpretation was Trove (and Colt) can't be distributed, but we can copy required classes. Please let me know if I am way off in that interpretation. 

I got the errors after applying the patch. 



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745422#action_12745422 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

Yes, surely we can simply repackage the .jar to not even include those classes and redistribute only that part which we use and can license.
Or as you say merely reuse the source.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776999#action_12776999 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Sorry, have been heads down on trying to get a release out.

FWIW, in Ted's code, I suspect if we fix that funky Exception being thrown when it doesn't find an element (meaning it is 0) and return 0, performance will improve.

Otherwise, I'm fine w/ the stripped down version of Colt.  We have interfaces, so it isn't like we have to be married to it.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751484#action_12751484 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Yes, Sean is correct.  _IF_ the part of Colt we need is cern.colt* packages, as in:
{quote}
Packages cern.colt* , cern.jet*, cern.clhep

    Copyright (c) 1999 CERN - European Organization for Nuclear Research.

    Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. CERN makes no representations about the suitability of this software for any purpose. It is provided "as is" without expressed or implied warranty.
{quote}

Then we can use it.  We _CANNOT_ include anything under the hep.aida package.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781438#action_12781438 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

d'oh, missed the correct package names.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-165:
-----------------------------------

    Attachment: mahout-165-18nov.patch

Here is the updated patch.

The dependency on colt-1.2.jar needs to be added to core/pom.xml 

This patch reported following 3 failures.

----------
testAssignUnaryFunction(org.apache.mahout.matrix.TestSparseVector)  Time elapsed: 0.006 sec  <<< FAILURE!
junit.framework.AssertionFailedError: get [0] expected:<0.0> but was:<-0.0>
----------
testDot(org.apache.mahout.matrix.TestSparseVector)  Time elapsed: 0.001 sec  <<< FAILURE!
junit.framework.AssertionFailedError: dot expected:<16.939999999999998> but was:<16.94>
----------
testHashCodeEquivalence(org.apache.mahout.matrix.VectorTest)  Time elapsed: 0.051 sec  <<< FAILURE!
junit.framework.AssertionFailedError: expected:<-1302772127> but was:<2062521697>
----------

First two look innocent. 

The hashcode mismatch occurs because while iterating on the non-zero elements,  sparse vector gives elements in different order than dense vector. We could sort the elements before generating hashcode. Not sure if this is the optimal solution.



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781169#action_12781169 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

I may not have a very informed perspective on this, but I would expect Vector to live right next to Matrix. I don't see a problem with it living in the matrix library. Yes, it's core -- more core than core actually -- it's in a subset of core that's supposed to be reusable externally.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776945#action_12776945 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Well, I've always had good luck with Colt, but at least Ted seemed to feel that Colt was "no longer state of the art", but maybe he can chime in and elaborate.  

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761002#action_12761002 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Good luck with the "quick" part - there seem to be a lot of changes since you made it, and lots of tests fail even after it gets compiling again!

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748870#action_12748870 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Shashi,

Any thoughts on whether we can just pull out the appropriate pieces?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781451#action_12781451 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------

bq. From there, refactoring Vector to not have a Writable dependency (etc.) would be great, but let's handle that on a separate issue.

That separate issue should also include a labeling layer as well as the Writable layer.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-165:
-----------------------------

    Fix Version/s:     (was: 0.2)
                   0.3

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776748#action_12776748 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

*bump*

So what is the collective vision on this now?  Shashi says that this current impl (from this patch) is pretty slow, but I was thinking that it is actually just some bugs in the IntDoubleMap based impl (mentioned above).  Has anyone looked at that other vector implementation?  Once we have this patch working, we can compare it to colt, and see how they stand up.  

The question remains whether we should consider Commons-Math as our underlying linear package.  We can't really use cmath 2.0, because their api is missing some key features we need (iterators, for one thing, and they only have one sparse vector impl, based on the map, and don't also have an IntDoubleMapping based one, which is pretty key for performance of some sparse algs).  They do have lots of great solvers, but I've been seeing a disturbing number of bugs fixed on their SVD implementation lately (I mean, fixing bugs is great, but having them in there means we don't know how many more there are), and the impls there are translated from fortran, and the code is very dense and hard for me to debug if I see a bug pop up, personally.

The other option is to just get something like MTJ, but they depend on some f2j stuff, which is keeping commons-math from taking them, even though MTJ itself can be appropriately licensed.

Personally, while I hate reinventing wheels, I hate even more having to depend on other libraries which don't quite have the api we need or the functionality we need, when the primitives involved are so core to much of what we want to do.  So I'm more in favor of doing one of two things:  

 * use our own primitives as is, and improve them, remembering that our core competency is on *scaling*, so optimizing for the 1000-element dense double[] vectors and 10k double[][] matrices isn't where we care as much, and most linear libraries aren't in the same mode of thinking.

 * rip the unacceptably licensed parts of Colt out and use the rest for our linear routines.  The hep.aida.* packages are not needed and can be removed without losing the key functionality, and these packages are the only ones which aren't apache compatible.

Thoughts?  We need to settle on at least a wrapper linear API which we'll delegate to if we're going to allow that possibility, but even that involves a bunch of decisions on what should be in that api - not all implementations can do everything, so this form of flexibility comes at a price. 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779111#action_12779111 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

bq. So I found Wolfgang Hoschek, the author of Colt, and he confirms that it is no longer maintained, and wishes us the best of luck in taking it over for ourselves if we so desired.

I seem to recall him being a Lucene contributor in the past.  Perhaps he would be willing to donate Colt to Apache?  I don't think we can just bring in it's source and claim it as ours.  Another option is we see if he would move it over to Google Code and make some of us committers on the project.  Perhaps Commons Math is interested in it, too.



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix closed MAHOUT-165.
------------------------------


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-165:
-----------------------------------

    Attachment: mahout-165-trove.patch

For SparseVector, I have copied relevant source from Trove. 

Test case for TestSparseVector failed with following error.

get [0] expected:<0.0> but was:<-0.0>

Also there are other failures while formatting string with Gson :
java.lang.IllegalStateException: Circular reference found: org.apache.mahout.matrix.trove.TIntDoubleHashMap

Pointer to fix these cases will be helpful. 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751423#action_12751423 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

While I'm not a lawyer, I am all but certain there is no distinction between distributing a .jar and distributing a class -- in fact distributing source typically carries more restrictions. So, I am pretty sure we can't use Trove if its license is not compatible, in any form.

Colt appears to license its code in two parts, and the part we need is licensed compatibly. To be completely safe with them, we'd need to copy only the part that is suitably licensed. If that means repacking the .jar or copying source or whatever, is up to us.

Others -- how's my interpretation?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-165:
-------------------------------

    Attachment: MAHOUT-165-updated.patch

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754585#action_12754585 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Shashi,  did you try Ted's patch?  If that works is sufficient, I'd prefer to use it since then we don't have to introduce another dependency.  I am trying it now.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778944#action_12778944 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

I generally favor including stuff as an intact library rather than cracking it open and modifying it, all else equal. Because this is a first step down the road to forking, and I'd hate to fork Colt without good reason.

So I suppose the easiest thing is a .jar file with the unusable classes stripped out. That should be it. We don't need to remove classes that merely depend on hep.aida.*, unless a class we need depends directly or indirectly on a class that references it. Ideally that's not true; we'll see in practice. Even then, just a matter of taking those out too.

We don't necessarily need source and I'd favor not incorporating as source, and not modifying the package names. Yes its dependencies could be changed and updated but I'd say let's not bother until there is a case for doing so. What would we do when Colt is updated, re-port the updates? We'd have a fork on our hands and might as well just decide that's what we're doing.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760956#action_12760956 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

Are my conclusions sound then:

We agree that equals() should be 'pretty strict'. The conventional Java wisdom is that equals(), in fact, ought not return true for instances of differing classes, unless you really know what you're doing. I guess we do. :)

If the idea behind equals() is "do class-specific stuff, otherwise, check names, and use equivalent() then", then we don't need strictEquivalence() -- where's it used?

(If I represented the logic correctly above -- is that as simple as we can make it? seems a touch complex)

I am not sure anything is 'broken' in practice here but I sense it could be simpler.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748904#action_12748904 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

I'm fine with copying relevant classes from Colt or Trove. 

Please let me know your library of choice. I will create the patch and upload. 



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765554#action_12765554 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Shashi, can you share your test vectors?  Would be good to have everyone use the same for baseline.

Also, I certainly can create more vectors at: http://people.apache.org/~gsingers/wikipedia/

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-165:
-----------------------------------

    Attachment: colt.jar

Jar for Colt after removing the LGPL code of hep.aida and the the dependent classes. The classes in colt.matrix.*  are removed as they require hep.aida.

This jar will work with my first patch on this issue. 

PS: I am uploading this jar with license to ASF. Not sure what the correct procedure is. 

 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745064#action_12745064 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

(Since I wrote that class, I feel compelled to take blame / defend it: it is *not* a hash-based implementation. that's the issue, not the implementation per se. It was merely put in place to replace a comparable non-hash data structure. Now we know a hash is more suitable here, so this is definitely a win to change to an actual hash implementation.)

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-165:
-------------------------------

    Attachment: MAHOUT-165.patch

Add bespoke implementation of HashVector and change the class structure to allow new sparse vector implementations.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751385#action_12751385 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

Wait a sec, I thought we had concluded that we *cannot* use Trove. It was Colt that had a portion which was licensed acceptably.

Are you saying these errors occur before you change? I don't see these failures in head.

The first error -- can't tell you why it happens but can explain it more, if that's what you're asking. Zero and negative zero are actually different doubles, and they aren't ==. Somehow the computation has changed in your patch such that a result ends up zero, but negative zero actually. One might say the test should actually not compare doubles for exact equality, but for equality to the last decimal place or something. But I don't see how this change should have affected this result, period, so probably should be viewed as a problem with the patch or Trove or some funky interaction.

Sounds like Gson can't serialize/deserialize the trove class correctly because of some circular reference among the instances. Dunno why that would be a problem.

But I think all this is moot since we can't use Trove?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-165:
-----------------------------------

    Attachment: mahout-165.patch

Patch for using Colt in SparseVector

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779068#action_12779068 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

bq. Should all code be marked deprecated until it has unit tests with a comment to say why it is deprecated? That would give us a clear visual signal that the lifeguard has left the pool.

+1

Sounds like a good plan.

So if there's going to be another submodule, for colt, I'm not sure if I'm qualified to do that - I'm not a maven expert, so writing a new pom I could (and should, and will!) learn how to do, but it would slow down this process.

Is it ok if I post a patch which has the org.apache.mahout.colt.* code just living inside of core/source/main/java, and then someone else can help out by refactoring my patch to have the right build / maven setup?  Any volunteers for that?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-165:
-------------------------------

    Attachment: MAHOUT-165-with-colt.patch

While there isn't yet a consensus on where this stuff will live, for now, so we can see what we're getting into, I've attached a patch which includes, currently in core/src/main/java, two new packages: org.apache.mahout.colt and org.apache.mahout.jet.

The only modifications I've done to these files is removing any dependency on hep.aida.*, the LGPL'ed corejava.*, and cern.clhep physical constants stuff, and I also, ironically enough, removed an ugly "©" symbol which at least on my system was causing javac to blow up during compile (I left the rest of the entire copyright notice intact, just without the © symbol in there). 

I added:

{code}
/**
 * @deprecated until unit tests are in place.  Until this time, this class/interface is unsupported.
 */
@Deprecated
{code}

in front of every public class or interface.

Also, to get it to compile, I added a maven dependency on Doug Lea's concurrent jar, which it turns out is accessible in apache's maven repo.

If we really want this out in google-code somewhere, I can easily package up this patch and put it somewhere else.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779033#action_12779033 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

So I found Wolfgang Hoschek, the author of Colt, and he confirms that it is no longer maintained, and wishes us the best of luck in taking it over for ourselves if we so desired.

If we transplant it (I'd rather call it a transplant than a fork, if the original trunk of the tree is dead), what's the procedure?  

  * Build a jar, put it in the apache maven repository?  
  * Include all allowed source (inside of core/source/main/java?) with original package names and no changes other than removing the hep.aida.* classes?
  * something else?



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781165#action_12781165 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

bq. +1 yes the order of dependency makes the most sense; also makes it easy for others to reuse who are interested in only the matrix library.

+1 from me as well.  This also makes it easier to keep decomposer decoupled - it can depend on mahout-matrix, and not need mahout-core.

But what about Vector?  Vector coming out of Core and moving into another module, well, everything depends on Vector, doesn't it?  In what way does this help?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779398#action_12779398 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Shashi, is this patch just an update of the one which Drew made earlier?  I wasn't seeing any test failures when I ran them earlier on that patch...

Or is your patch the first to actually use the newly-included colt code?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745415#action_12745415 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

http://acs.lbl.gov/~hoschek/colt-download/releases/license.html 

The clause for military use in HEP is unacceptable, unfortunately.  There is no way we could enforce that.  If we could use Colt w/o HEP, then we could use Colt.  Perhaps there is just a class or two from Colt that is ASL licensed that we can copy over and properly attribute?  

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-165:
--------------------------------------

    Assignee: Grant Ingersoll

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745178#action_12745178 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

Does Colt have license issues? It is all but public domain. It contains only a notice license.

I'd support Someone Else looking at rewriting everything to use another matrix library. Doesnt totally make sense to have our own and the abstractions are already a little creaky.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779046#action_12779046 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

+1 to a transplant, and rename to org.apache.mahout.colt
+1 to the plan to write tests as we go

I'm surely up for the task of straightening up the code to match the project, stuff like updating for java.util.concurrent, etc.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779082#action_12779082 ] 

Drew Farris edited comment on MAHOUT-165 at 11/17/09 7:53 PM:
--------------------------------------------------------------

{quote}
Is it ok if I post a patch which has the org.apache.mahout.colt.* code just living inside of core/source/main/java, and then someone else can help out by refactoring my patch to have the right build / maven setup? Any volunteers for that?
{quote}

I know maven well enough to do this. If you post a patch, I'll do the refactoring necessary to get it set up as a module.

{quote}
Should all code be marked deprecated until it has unit tests with a comment to say why it is deprecated? That would give us a clear visual signal that the lifeguard has left the pool.
{quote}

+1 sounds like a good idea to me. It's just a matter of adding "@deprecated no unit tests.." tag to each class, no?


      was (Author: drew.farris):
    {quote}
Is it ok if I post a patch which has the org.apache.mahout.colt.* code just living inside of core/source/main/java, and then someone else can help out by refactoring my patch to have the right build / maven setup? Any volunteers for that?
{quote}

I know maven well enough to do this. If you post a patch, I'll do the refactoring necessary to get it set up as a module.
  
> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751520#action_12751520 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

OK.  Should I copy relevant classes source from Colt or we can distribute the binary? 

BTW, Trove is LGPL.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745417#action_12745417 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Also, perhaps there is a snapshot available for MTJ in Commons Math?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745409#action_12745409 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

http://acs.lbl.gov/~hoschek/colt/dependencies.html

This says Colt's license is Apache style. 

It has some dependency on LGPL jar, but we don't need it. 


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779121#action_12779121 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

bq. Perhaps he would be willing to donate Colt to Apache? I don't think we can just bring in it's source and claim it as ours. Another option is we see if he would move it over to Google Code and make some of us committers on the project. Perhaps Commons Math is interested in it, too.

I asked him, and he said that we can go ahead and have it, he's not maintaining it anymore.  It's Apache-licensed, so can't we take it, regardless of whether he was contactable or not, as long as we attribute him and abide by the license he put on the code:

    Packages cern.colt* , cern.jet*, cern.clhep

    Copyright (c) 1999 CERN - European Organization for Nuclear Research.

    Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. CERN makes no representations about the suitability of this software for any purpose. It is provided "as is" without expressed or implied warranty.

But either way, he did say we could have it.

Commons math was interested at some point, as well as being interested in MTJ, but they seem to have abandoned the idea of incorporating anyone else's primitives anytime soon, as they are not willing to break their backwards compatibility reqs until 3.0 (and since 2.0 just came out a few months ago, we're talking a looooong time).

Putting it on google-code is an interesting option, it would resurrect the project, from a perspective of people outside of Mahout... I kinda like that idea.  Would it slow down our ability to use it, to do this?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779127#action_12779127 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

Regarding @deprecated: sounds a little aggressive to me, but not against it.

I also like the idea of getting it donated to Apache, since it is cleanest and consistent with the intent here -- to move it forward in an Apache project.

But failing that, its license still allows us to include, modify, distribute the code. We'd have to still note its copyright and license, of course. So it's not a blocker.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778937#action_12778937 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

If we're going to try out a patch which includes Colt, we really need the Colt source, properly cleaned of offending material, not just a jar, right?  

Similarly, as I mentioned above, cern.colt.matrix.* classes are pretty important, and can be included if a little care is made in pulling out the hep.aida.* dependencies.  Shashi, your colt.jar doesn't have the cern.colt.matrix included, do they?

If I post a patch with the entire (cleaned) source tree of colt, can we apply it?  What is the procedure for doing this kind of thing?  Are we keeping the package hierarchy intact, or should we do a swap of cern.colt to org.mahout.colt?  If this kind of thing is done, we'll want it to live in it's own maven sub-module in here, I would imagine.

On the topic of dependencies, colt internally has a dependency on Doug Lea's original edu.oswego concurrent library, which is public domain, so that's ok, but should be upgraded to java.util.concurrent.  Unfortunately, not all classes in edu.oswego.concurrent have counterparts in java.util.concurrent yet: the fork/join framework doesn't make it itno core java until 1.7, and is used inside of colt... so there's a dependency on concurrent.jar... does the apache maven repo have concurrent 1.3.4 in it?  ibiblio does appears to...

Sorry to make things complicated - colt has a lot more than just a SparseVector implementation, so if we're going to include it, we should make sure to get the benefit of all it has to offer.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778973#action_12778973 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

Oh I see. Well then, sounds like a compelling reason to 'fork'. FWIW I say go for it.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779650#action_12779650 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------


bq. Then my comment is just that the hash code can be defined such that it doesn't need elements in order to be computed, so shouldn't.

java.util.AbstractMap seems to think that the sum of the hash codes of all entries is good enough.  And that is good enough for me.  Using a commutative operator gets rid of the need for ordering.

I would propose basically the same thing:

hashcode(Matrix) = sum_row hashcode(row)

hashcode(Vector) = sum_(i,v_i) hashcode(i) + hashcode(v_i)

for hashing integers, the integer itself is fine.  For doubles, doubleToLongBits provides what we need (just xor the two halves of the resulting long).
           

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745415#action_12745415 ] 

Grant Ingersoll edited comment on MAHOUT-165 at 8/20/09 5:05 AM:
-----------------------------------------------------------------

http://acs.lbl.gov/~hoschek/colt-download/releases/license.html 

The clause for military use in HEP is unacceptable, unfortunately.  There is no way we could enforce that.  If we could use Colt w/o HEP, then we could use Colt.  Perhaps there is just a class or two from Colt that is ASL licensed that we can copy over and properly attribute and that doesn't have any dependencies on HEP?  

      was (Author: gsingers):
    http://acs.lbl.gov/~hoschek/colt-download/releases/license.html 

The clause for military use in HEP is unacceptable, unfortunately.  There is no way we could enforce that.  If we could use Colt w/o HEP, then we could use Colt.  Perhaps there is just a class or two from Colt that is ASL licensed that we can copy over and properly attribute?  
  
> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-165:
-------------------------------

    Attachment: MAHOUT-165-with-colt-module.patch

As discussed earlier I've taken Jake's patch and moved all of the colt source into a new module under mahout.

After this patch is applied, colt can be built  from the top level package adding -Pcolt to the maven invocation, e.g: {{mvn clean install -Pcolt}}. The result is {{mahout-colt-0.3-SNAPSHOT.jar}}, which in turn could be added as a dependency in core, but I'll leave that another patch once there's stuff in core that actually uses colt.

I've removed the concurrent dependency from the core pom added by Jake's patch. Other than that and the relocation everything else is the same as waht Jake provided.




> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777050#action_12777050 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------


My issues (which I used for quite some time) were probably either remediable or irrelevant.

The remediable problem was that the API was opaque for new-comers and very difficult to extend with new matrix implementations.  If we take Colt as a starting point and fix some of the extension and opacity issues, then this problem goes away.

My second issue is that more modern libraries like MTJ can achieve about 4x the raw performance of Colt.  As Grant rightly points out, that probably doesn't matter to us right away since the goal here is scaling rather than raw hot-iron performance on a single box.  Moreover, as Grant also points out, we will have a pluggable interface which should allow us to switch if the commons math guys ever come around.



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779456#action_12779456 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

All sounding pretty good.  If you don't mind, I'd like to review the legal bits before committing.  I have a deadline on Thursday, but can get to it after that.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778832#action_12778832 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Shashi, can you make sure the patch is up to date?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754147#action_12754147 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

I think we will want doubles, but perhaps we can have two implementations and people can pick.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760886#action_12760886 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

One test which is failing is the basic VectorTest case which checks equals() - what are we considering the contract of equals() to be on vectors?  I would normally assume the functionality in AbstractVector.equivalent() should be what equals() returns, but is this not done so we can compare while ignoring the name?  Or is there some more important reason why we say that a DenseVector and a SparseVector which are the same "vector" in the mathematical sense are not returning equals() as true on each other?

Speaking of which, why do we have these static methods for "equivalent()" and "strictEquivalence"?  Do we need something different from "equals()" which is true mathematical equals (currently the functionality of equivalent())?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757022#action_12757022 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

FWIW this is just what the FastIDSet and FastMap things I cobbled together do. Open addressing indeed means using a special value for empty versus formerly occupied. I think a separate array for these states is expensive -- just used, for example, a pointer to a dummy object instead as a special value, or null. I also included rehash() methods that could grow the array, or, merely reshuffle to free up the formerly-occupied slots.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781171#action_12781171 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Ok, I agree with this - I'm just wrapping my head around the dependency tree.  So mahout-matrix contains Vector and Matrix (and all the libraries-formerly-known-as-colt), and core depends on it, and then lots of stuff depends on core and it, or just core and hence matrix by transitivity (this part I don't know how it's done in Maven).

Then this allows some modules to depend on mahout-matrix but *not* on core, which is nice, yes, it's much more reusable.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777082#action_12777082 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Ok then, let's try out Colt, unless we have a more permissive policy in here about MTJ than the c-math guys have: they didn't want MTJ because using it required either including a jar file of the output of f2j translations of some Fortran code... which is ok for us as long as it's apache-compatible, since we don't have the hard "no external dependencies" requirement that they have.  

What Shashi wrote before was this, when he attached the modified colt jar:

bq. Jar for Colt after removing the LGPL code of hep.aida and the the dependent classes. The classes in colt.matrix.* are removed as they require hep.aida.

I actually stripped the hep.aida.* dependencies out of even the colt.matrix.* classes in Colt on my local gitrepo, which keeps pretty much all of the functionality intact.  I can make an updated patch which has the full source code for that, so that we can include it instead of just having a jar.

Do we want to try comparing both MTJ and Colt?

Also: do we think our linear API is "complete" enough to solidify on as a wrapper for whatever is plugged in underneath?  Some of the changes which have been discussed in other tickets and on the list are

* pulling Writable off of the interface, so that not every impl is hooked into such a coupling to Hadoop, then wrapping it with a Writable wrapper / subclass to add that functionality
* the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) and double aggregate(Vector other, BinaryDoubleFunction aggregator, BinaryDoubleFunction map) methods for abstracting away inner products and norms.  Not necessary, but very easily implemented in AbstractVector so that nobody needs to worry about these methods if they don't like programming that way.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777089#action_12777089 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------


bq. pulling Writable off of the interface, so that not every impl is hooked into such a coupling to Hadoop, then wrapping it with a Writable wrapper / subclass to add that functionality

+1

Same thing should be done with row and column labels.

Not sure how to handle matrices of indefinite dimension which are probably important for some of what we do.  Perhaps just declare them as very, very large in a wrapper.

bq. the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) and double aggregate(Vector other, BinaryDoubleFunction aggregator, BinaryDoubleFunction map) methods for abstracting away inner products and norms.  Not necessary, but very easily implemented in AbstractVector so that nobody needs to worry about these methods if they don't like programming that way.

These are very handy function.  Row and/or column aggregator functions are also important.

Colt gets a big boost in speed by testing in the implementation for special combinations of these functional constructs.  That lets it implement dot and sum with bespoke code and avoid the function call overhead (with associated risk of the JVM not in-lining enough).

Another big change is that Colt makes extensive use of view semantics.  I think that this is a really good idea, but it does differ a bit from what we have done so far.




> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760879#action_12760879 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Hey Ted, I tried bringing your patch up to current trunk, but while I could get it to compile (eventually!), it's got 11 failing tests for me.  Do you have an up-to-date copy of this patch, or should I just keep hammering away at this old one until all the tests pass? :)

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754154#action_12754154 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------


If we need something more than small integers, it is likely that we will need doubles.  Floats are just tooo easy to get into trouble with round-off with large numerical operations.  Even something as simple as adding up a list of numbers can be difficult without enough resolution.  Since we generally can't afford the time and effort to do detailed numerical stability analysis, we should probably stick with doubles for most floating point work.



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756526#action_12756526 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

Since, I couldn't apply Ted's patch to trunk, I only tested the IntDoubleHash in isolation.  Performance-wise, it is as good as Colt.

But, there are other issues with the basic implementation.

1. The class exposes the internal index array instead of keys. The internal array may have empty slots (marked with value -1). This is not consistent with a typical hash implementation. The side effect is extra work by the callee to only check the keys greater than zero. 

2. The clone() method has a bug. Instead of copying the entire index & value array, it only copies the count of valid values in the map. 

3. We don't need right now, but there is no remove() method. 

Of course, all these are fixable issues.  But, if we again need something similar, Colt will prove to be of great help. 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779591#action_12779591 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

bq. I think this deserves a bit more thought. I was actually surprised to hear iterateNonZero() doesn't iterate in order?? that is surprising. Can that be right?

I'm pretty sure this is in all of the hash-based implementations I've seen.  Nobody goes to the trouble to make it a linked hash or whatnot, because that slows down insertion (to ad a different complexity class!) and takes up more memory.

In my view, there's really two different sparse vector implementations: one primarily read-only and optimized for speed of iteration (and can guarantee ordering), and another which is optimized for random-access read-write actions.  The client needs to know which action they're going to be using most and make the choice of impl based on that (and if necessary, make a switch partway through: I often build up vectors using a map-based approach, then "seal" them into a faster read-only form if I don't need random-access reads done on them [which is the slow read-only action on int[] / double[] vectors])

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779642#action_12779642 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

Fair enough. Then my comment is just that the hash code can be defined such that it doesn't need elements in order to be computed, so shouldn't.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789721#action_12789721 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

So this is done now right? SparseVector uses OpenIntDoubleMap now.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-165:
-----------------------------------

    Attachment: mahout-165-18nov-updated.patch

I am updating the patch to ensure hashCode() is evaluated correctly. Since hashCode() needs elements in sorted order and many other routines are OK with any order, I have added an extra method to return iterator with sorted elements. Only SparseVector handles the sorting flag. 

This is built on Drew's patch. So, the namespaces are changed from colt to  org.apache.mahout.colt.* 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779116#action_12779116 ] 

Otis Gospodnetic commented on MAHOUT-165:
-----------------------------------------

Yes, Wolfgang contributed MemoryIndex.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779651#action_12779651 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

bq. Then my comment is just that the hash code can be defined such that it doesn't need elements in order to be computed, so shouldn't.

Totally agree. 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751534#action_12751534 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

I suggest we distribute in binary form. Just unpack the jar, delete the bits we can't distribute, and repack it.

Yes it's LGPL but as I understand even that is problematic?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779496#action_12779496 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

I think this deserves a bit more thought. I was actually surprised to hear iterateNonZero() doesn't iterate in order?? that is surprising. Can that be right? The argument I suppose is that some callers don't need ordering so why provide it, for efficiency reasons. If that's the thinking, then hashCode() needs to simply take account of the index, not order the elements. A consistent/correct hash code doesn't necessarily need this.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781427#action_12781427 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

OK, I am committing the Matrix module.  Once done, I am going to move our Matrix stuff out of core and into the Matrix module.  Then, Shashi, if you can update your patch, that would be great.  From there, refactoring Vector to not have a Writable dependency (etc.) would be great, but let's handle that on a separate issue.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix resolved MAHOUT-165.
--------------------------------

    Resolution: Fixed

I think we can finally close this monster.  Work continues in MAHOUT-204, MAHOUT-205, and MAHOUT-206, among other places.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-165:
-----------------------------------

    Attachment: MAHOUT-165-colt.patch

The Colt stuff looks good, my only concern, legally, is the name, oddly enough.  I don't think we should call it Colt.  AFAICT, that name is owned by CERN and while the license allows us to bring over the code, it doesn't give us rights to the name.

This patch changes the name to matrix, adds the appropriate legal bits to NOTICE.txt and LICENSE.txt

This just covers the Colt stuff, it does not apply Shashi's patch.  

It seems like we should just move our Matrix (currently in core) out to this package and have core have a dependency on this module.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756790#action_12756790 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------

bq. 3. We don't need right now, but there is no remove() method. 

remove will be a PITA to get right.  The problem is with collisions and the double hashing.  When you remove something, you don't know what other keys may have collided with what you are removing.  That means that you need to leave a marker behind so that other searches will still view that slot as occupied.  Repeated insert/remove/insert will ultimately cause the array to resize itself.  

I would propose extending the current empty index mark (-1) to include a formerly occupied mark (-2).  Then the scanning would have to be clever enough to treat empty and formerly occupied differently.



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745405#action_12745405 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

I couldn't locate the primitive hasmap in MTJ. Or is it the case that we will be directly using the vector class from MTJ? 

I checked some discussion on Commons Math list about Colt. But I am not sure about the conclusion of that conversation. 


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778971#action_12778971 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Isn't colt abandoned?  I'm not in favor of cracking open libraries either, but I was under the impression that colt has not been maintained or modified since 2004.

We really shouldn't include a binary jar for an abandoned distribution which people have to hunt down the source for, right?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766231#action_12766231 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

Shashi's vectors are at:  http://people.apache.org/~gsingers/mahout/vectors-test-mahout.gz.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779411#action_12779411 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

+1 to making this a Mahout module. There's not much difference in making this a separate project that's Mahout-related, other than that is more work. mahout-colt shouldn't depend on mahout-core or anything, so it can still be used independently.

I say commit this (with proper license attribution at the moment) so we can move forward!

Sounds like Shashi's patch should then follow, using mahout-colt.

The unit test failures sound like either an issue in the patch, or in the Vector.hashCode() implementation. Vector elements are ordered, so equals() must pay attention to order, therefore so must hashCode(). If that's not true it's a bug. If it's a bug, fix it in your patch.

With these two, can we finally commit and close this monster issue?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776924#action_12776924 ] 

Sean Owen commented on MAHOUT-165:
----------------------------------

IntDoubleHash right? We could look at that, but I thought the status here was that Colt worked just fine and fast. Perhaps I miss something but I don't see a remaining issue with using (part of) Colt.

I somehow strongly suspect we will benefit from not reinventing a wheel here, and whatever we need can be done with Colt, plus perhaps some contributed changes, plus a custom implementation here and there.

+1 for Whatever Is Needed To Use Colt?

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779134#action_12779134 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------

bq. +1 sounds like a good idea to me. It's just a matter of adding "@deprecated no unit tests.." tag to each class, no?

Correct.

bq. Regarding @deprecated: sounds a little aggressive to me, but not against it. 

I would very much like to do it so that we have some way of keeping track which parts have been validated and tested.

bq. .... Perhaps Commons Math is interested in it, too.

Commons math has been pretty aggressively uninterested in contributions lately.  I have been involved in some patches to make distributions and sampling more usable and much more widely available.  Jake was recently trying to help get sparse matrices to a usable state.  The result was lots of API whining and complete loss of momentum for improvement.  My own opinion is that it is not practical to contribute anything more than completely trivial items to commons-math and I say that we should go forward and not wait for them.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779335#action_12779335 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Awesome, thanks Drew.  I noticed you didn't add a {code}<module>colt</module>{code} line inside of the top level pom, is this to hide it from being depended on?  I just ask because it meant that IntelliJ didn't seem to want to consider the mahout-colt submodule to be a real maven submodule without that there.

So what are the next steps here?  If we drop this into Mahout now, we can take advantage of it very quickly, which would be great.  Having it be it's own maven submodule means that it should be fairly easy to pull it out of Mahout entirely later, if that is the desire, right?   Grant, what are your thoughts on this?

I would love to see this package up on google-code/sourceforge/github, because then other projects could use it as well, without having to depend on Mahout.  I just don't want to slow down the process of solidifying our linear primitives here in Mahout-land.  The sooner we do that, the better.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760898#action_12760898 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

There are some thoughts on equals, etc. in the archives and other JIRA issues. 

Here's what I recall:

1. We want DenseVectors and SparseVectors with the same names to be equal in the equals() sense.  The implementations of equals in SparseVector and DenseVector are the equivalent, AFAICT, as the implementations in equals(), but for #2:
2. We don't just defer to strictEquivalence b/c the thinking is that we can do much faster equals comparison if we know what type of vector it is, which is why SparseVector checks to see if "that" is a SparseVector, otherwise deferring to equivalent (since names have already been checked).  I haven't validated whether they truly are faster.


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781467#action_12781467 ] 

Grant Ingersoll commented on MAHOUT-165:
----------------------------------------

OK, I committed Shashi's patch and fixed the colt package name remnants.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779050#action_12779050 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

bq. The colt tree could also be put into a separate module that lives alongside core, util, examples, built independently as a part of the maven build - optionally at first, activated via a build profile.

+1 - I like this.

bq. As far as package names, would it be better to map cern.colt.* to org.apache.mahout.colt.* ? - that way there's no potential for the old being confused for the new in builds, etc.

I personally think this is the way to go, but does it reduce confusion, or increase it?  People who are used to using colt will see familiar classes, but in strange places.  If we're really going to overhaul the whole library over time, this makes sense, I guess.

bq. Would the cern.jet.* libraries be included as well?

In the vivisection I've performed (locally) on the last updated version of Colt, the colt.jet.* packages were able to be preserved without running into any licensing or dependency problems, so I've kept them, but they do duplicate some work we already have: there's a ton of random distributions, and stats for computing quantiles, a MersenneTwister impl, etc.

We could include them at first, and then do some perf testing / api twiddling over time to see which impls we want to keep where there are duplicates?

The only parts of colt which are removed are hep.aida.* and corejava.* (the latter is LGPL, but is not needed).  At the top level, what's left are cern.colt, cern.jet, and cern.clhep, but the latter can be removed also, because I'm pretty sure Mahout doesn't need to know the double value of Planck's constant (besides, as a former theorist, on principle I should note that the value of hbar is definitively (double)1, with no units, in natural units).

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751519#action_12751519 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------


To amplify on Grant's reply:

Sean is correct.  Trove is GPL and cannot be used or copied.  It is fine to do experiments with it.

I need to post my patch.  It isn't usable as is, but with a little help, it could solve the problem.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>             Fix For: 0.2
>
>         Attachments: mahout-165-trove.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781153#action_12781153 ] 

Drew Farris commented on MAHOUT-165:
------------------------------------

bq. The Colt stuff looks good, my only concern, legally, is the name, oddly enough. I don't think we should call it Colt. 

+1 for mahout-matrix

bq. It seems like we should just move our Matrix (currently in core) out to this package and have core have a dependency on this module.

+1 yes the order of dependency makes the most sense; also makes it easy for others to reuse who are interested in only the matrix library.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-18nov-updated.patch, mahout-165-18nov.patch, MAHOUT-165-colt.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779039#action_12779039 ] 

Drew Farris commented on MAHOUT-165:
------------------------------------

The colt tree could also be put into a separate module that lives alongside core, util, examples, built independently as a part of the maven build -- optionally at first, activated via a build profile.  As some point in the future, once it is ready for prime time, it is added to the regular build and core would have a dependency on it of course.

As far as package names, would it be better to map cern.colt.* to org.apache.mahout.colt.* ? -- that way there's no potential for the old being confused for the new in builds, etc.

Would the cern.jet.* libraries be included as well?


> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The present implementation of this hash map is not as efficient as some of the other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of  Colt performance an order of magnitude better than OrderedIntDoubleMapping. For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an experimental dataset, the current implementation takes 50 minutes. Using Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.