You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Timothy Potter (JIRA)" <ji...@apache.org> on 2011/05/01 00:50:03 UTC

[jira] [Commented] (MAHOUT-639) Need special case to handle creating a new SequentialAccessSparseVector from a large (> 1M dims) random/hashed vector

    [ https://issues.apache.org/jira/browse/MAHOUT-639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027399#comment-13027399 ] 

Timothy Potter commented on MAHOUT-639:
---------------------------------------

Hi Jake,

Agreed on the unit test timeout not being a good approach. Thanks for cleaning up the code I submitted. 

> Need special case to handle creating a new SequentialAccessSparseVector from a large (> 1M dims) random/hashed vector
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-639
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-639
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.4, 0.5
>            Reporter: Timothy Potter
>            Assignee: Jake Mannix
>            Priority: Critical
>              Labels: matrix, scalability, svd, transpose
>             Fix For: 0.6
>
>         Attachments: MAHOUT-639.patch
>
>
> When trying to transpose a matrix of tfidf vectors created from text documents (ASF mail archives in this case), there is a bottleneck in the TransposeJob's reducer when Mahout creates a new SequentialAccessSparseVector from a RandomAccessSparseVector after the while loop completes:
>       SequentialAccessSparseVector outVector = new SequentialAccessSparseVector(tmp);
> For high-frequency terms (some of which occur over ~1M times in my data), the code to create a SequentialAccessSparseVector from a RandomAccessSparseVector bogs down completely .... 
> From Jake Mannix:
> "Suspicion confirmed:
>   public SequentialAccessSparseVector(Vector other) {
>     this(other.size(), other.getNumNondefaultElements());
>     Iterator<Element> it = other.iterateNonZero();
>     Element e;
>     while (it.hasNext() && (e = it.next()) != null) {
>       set(e.index(), e.get());
>     }
>   }
> we iterate over the other vector (which is in random/hashed order), adding it to the sequential access vector (which always tries to stay in sequential order).  So actually, this may be *worse* than O(n^2), but I'd prefer to just not know how much worse, and instead we should fix it.
> Should be fairly straightforward: make an array of structs (essentially) with the index and the double, of size other.getNumNonDefaultElements() (what a horrible method name), fill it up on one iteration over the other vector, sort it in place, then make your new OrderedIntDoubleMapping out of the indexes and values (unless someone has a cleverer idea to sort a pair of two arrays at the same time, shuffling one based on the ordering criterion of the other)."

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira