You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Timothy Potter (JIRA)" <ji...@apache.org> on 2012/06/15 01:12:42 UTC

[jira] [Updated] (MAHOUT-1019) VectorDistanceSimilarityJob

     [ https://issues.apache.org/jira/browse/MAHOUT-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Potter updated MAHOUT-1019:
-----------------------------------

    Attachment: MAHOUT-1019.patch

Ok, so here's a solution to this problem. It really made a huge difference in the size of the output in our environment. With this patch, the output was reduced to a few Gigs as opposed to 4+ TBs!

I only applied this to the "pw" style output as it probably isn't useful for the "v" style output. Also, the default value is Double.MAX_VALUE since a default of 1.0 would imply you were using cosine distance.
                
> VectorDistanceSimilarityJob
> ---------------------------
>
>                 Key: MAHOUT-1019
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1019
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>         Environment: all
>            Reporter: Timothy Potter
>            Priority: Minor
>              Labels: VectorDistanceSimilarityJob, distance, vector
>         Attachments: MAHOUT-1019.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> The VectorDistanceSimilarityJob is a fantastic tool, but poses the risk of creating terabytes of output of dubious value. For example, I have ~10K seed vectors and millions of vectors to compute the similarity between so I would like to add an optional parameter to this job to specify a maximum distance threshold that prevents any distances above this value from being written to the output. The default would be 1.0d so no filtering is applied which ensures backwards compatibility, but if supplied, only rows where the distance is less than the threshold would be output from the mapper. This can help reduce the storage requirements of the output immensely. Probably name the parameter something like: noOutputIfDistanceGreaterThan

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira