You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2009/08/31 08:55:21 UTC

issues found

I just did an exercise of implementing a faster sparse vector.  In the
process, I uncovered a bunch of warts.  I will be filing some Jiras and
patches as soon as I can get to them, but here is a preview:

a) most serialization code for vectors would write vectors with null names
out as if they had names of "".  This causes grief in tests and seems wrong.

b) the Vector/SparseVector hierarchy was oddly split out.  I added a
HashVector and moved the current SparseVector into IntDoubleMappingVector
with SparseVector remaining as an abstract class.  This unfortunately caused
lots of upheaval even up into the Vector class.  I have yet to sort this out
cleanly.

c) the squared distance functions were defined in multiple places.  I
centralized these into SquaredEuclideanDistance as a static function.  These
definitions also were wrong and would ignore any components that had
opposite sign and equal magnitude.  In fixing this, I went ahead and wrote
an implementation that makes use of any sparsity present.  To do that, I
added a method to Vector so that I could tell if a vector is sparse.  This
subsumes all of the distance optimizations that we have discussed.  It also
makes it very easy for new code to use these optimizations without knowing
about them.

d) the nonZeroIterator had to be substantially refactored to work in an
abstract class without knowing to much about the internals of everything.

e) in order ot make sure that all metrics worked reasonably with all types,
I have heavily refactored the testing structure for metrics.  I will be
doing more of this as well.  The goal is to test all vector types against
all distance metrics to make sure that they give the same results as
DenseVector.  Then, I will build tests for DenseVector to verify that it
produces the correct result.  This is important because so many vectors have
special code to help metric computation.

f) I have uncovered some strangeness in the ARFF I/O code that I introduced
with the SparseVector abstraction.  The old code will work as it did, but it
won't understand the new HashVector.


Soo....

I will be trying to break this down into as small a pieces as I can, but the
total will be a bunch of interdependent patches.  If anybody can help me
apply these as quickly as possible, we should have minimal problems with it
all.  If it drags out, it will get hard to keep rebasing the patch sequence
all the time.



-- 
Ted Dunning, CTO
DeepDyve

Re: issues found

Posted by Ted Dunning <te...@gmail.com>.
So far, I am pretty much not happy.  If it goes up now, it will take some
critical eyes to get it all clean.

On Mon, Aug 31, 2009 at 3:20 AM, Grant Ingersoll <gs...@apache.org>wrote:

> These all sound reasonable.  If you're happy w/ the changes, just put up a
> big patch on an issue, give it a few days to percolate and then commit.




-- 
Ted Dunning, CTO
DeepDyve

Re: issues found

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 31, 2009, at 2:55 AM, Ted Dunning wrote:

> I just did an exercise of implementing a faster sparse vector.  In the
> process, I uncovered a bunch of warts.  I will be filing some Jiras  
> and
> patches as soon as I can get to them, but here is a preview:
>
> a) most serialization code for vectors would write vectors with null  
> names
> out as if they had names of "".  This causes grief in tests and  
> seems wrong.
>
> b) the Vector/SparseVector hierarchy was oddly split out.  I added a
> HashVector and moved the current SparseVector into  
> IntDoubleMappingVector
> with SparseVector remaining as an abstract class.  This  
> unfortunately caused
> lots of upheaval even up into the Vector class.  I have yet to sort  
> this out
> cleanly.
>
> c) the squared distance functions were defined in multiple places.  I
> centralized these into SquaredEuclideanDistance as a static  
> function.  These
> definitions also were wrong and would ignore any components that had
> opposite sign and equal magnitude.  In fixing this, I went ahead and  
> wrote
> an implementation that makes use of any sparsity present.  To do  
> that, I
> added a method to Vector so that I could tell if a vector is  
> sparse.  This
> subsumes all of the distance optimizations that we have discussed.   
> It also
> makes it very easy for new code to use these optimizations without  
> knowing
> about them.
>
> d) the nonZeroIterator had to be substantially refactored to work in  
> an
> abstract class without knowing to much about the internals of  
> everything.
>
> e) in order ot make sure that all metrics worked reasonably with all  
> types,
> I have heavily refactored the testing structure for metrics.  I will  
> be
> doing more of this as well.  The goal is to test all vector types  
> against
> all distance metrics to make sure that they give the same results as
> DenseVector.  Then, I will build tests for DenseVector to verify  
> that it
> produces the correct result.  This is important because so many  
> vectors have
> special code to help metric computation.
>
> f) I have uncovered some strangeness in the ARFF I/O code that I  
> introduced
> with the SparseVector abstraction.  The old code will work as it  
> did, but it
> won't understand the new HashVector.
>
>
> Soo....
>
> I will be trying to break this down into as small a pieces as I can,  
> but the
> total will be a bunch of interdependent patches.  If anybody can  
> help me
> apply these as quickly as possible, we should have minimal problems  
> with it
> all.  If it drags out, it will get hard to keep rebasing the patch  
> sequence
> all the time.
>

These all sound reasonable.  If you're happy w/ the changes, just put  
up a big patch on an issue, give it a few days to percolate and then  
commit.




>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search