You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Suneel Marthi (JIRA)" <ji...@apache.org> on 2013/07/25 14:27:48 UTC

[jira] [Updated] (MAHOUT-1019) VectorDistanceSimilarityJob

     [ https://issues.apache.org/jira/browse/MAHOUT-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suneel Marthi updated MAHOUT-1019:
----------------------------------

    Fix Version/s: 0.8
    
> VectorDistanceSimilarityJob
> ---------------------------
>
>                 Key: MAHOUT-1019
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1019
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.8
>         Environment: all
>            Reporter: Timothy Potter
>            Priority: Minor
>              Labels: VectorDistanceSimilarityJob, distance, vector
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1019.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> The VectorDistanceSimilarityJob is a fantastic tool, but poses the risk of creating terabytes of output of dubious value. For example, I have ~10K seed vectors and millions of vectors to compute the similarity between so I would like to add an optional parameter to this job to specify a maximum distance threshold that prevents any distances above this value from being written to the output. The default would be 1.0d so no filtering is applied which ensures backwards compatibility, but if supplied, only rows where the distance is less than the threshold would be output from the mapper. This can help reduce the storage requirements of the output immensely. Probably name the parameter something like: noOutputIfDistanceGreaterThan

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira