You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2010/09/23 14:22:36 UTC

[jira] Commented: (MAHOUT-308) Improve Lanczos to handle extremely large feature sets (without hashing)

    [ https://issues.apache.org/jira/browse/MAHOUT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914000#action_12914000 ] 

Sean Owen commented on MAHOUT-308:
----------------------------------

Is this patch still fresh, and commitable? looks like it's sort of pending review by Jake. I know this part has been in flux. Worth punting to 0.5?

> Improve Lanczos to handle extremely large feature sets (without hashing)
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-308
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-308
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.4
>
>         Attachments: MAHOUT-308.patch
>
>
> DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the driver (client) computer while Hadoop is iterating.  The memory requirements of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for desiredRank = a few hundred, starts to cap out usefulness at some-small-number * millions of columns for most commodity hardware.
> The solution (without doing stochastic decomposition) is to persist the Lanczos basis to disk, except for the most recent two vectors.  Some care must be taken in the "orthogonalizeAgainstBasis()" method call, which uses the entire basis.  This part would be slower this way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.