You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Suneel Marthi (Commented) (JIRA)" <ji...@apache.org> on 2012/01/30 05:57:10 UTC

[jira] [Commented] (MAHOUT-834) rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again

    [ https://issues.apache.org/jira/browse/MAHOUT-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195934#comment-13195934 ] 

Suneel Marthi commented on MAHOUT-834:
--------------------------------------

1. What was the outcome of this thread? I am having the same issue and had opened a Jira ticket - Mahout-964 (much before I saw this thread).  I do agree that the RowSimilarityJob needs an overwrite option to cleanup the output and temp folders from a  previous run.

2. Another concern I have is if the input similarity measure specified is not a valid one, like for example:-

mahout rowsimilarity --input matrixified/matrix --output sims_foo/ --numberOfColumns 27684 --similarityClassname SIMILARITY_COS --excludeSelfSimilarity

then RowSimilarityJob should exit immediately instead of going ahead with trying to execute the Normalizer, CooccurrencesMapper and UnsymmetrifyMapper.

3. The 'excludeSelfSimilarity' option needs to be given an explicit value of 'true' or 'false' otherwise the following always defaults to 'false'

mahout rowsimilarity --input matrixified/matrix --output sims_foo/ --numberOfColumns 27684 --similarityClassname SIMILARITY_COSINE --excludeSelfSimilarity

This is inconsistent with the way --overwrite option works. Merely specifying --excludeSelfSimilarity on the Commandline does not set it to 'true'.




                
> rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again
> ----------------------------------------------------------------------------
>
>                 Key: MAHOUT-834
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-834
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>            Reporter: Dan Brickley
>            Priority: Minor
>
> If I do this:
> mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity
> then clean my output and rerun,
> rm -rf sims/ # (though this step doesn't even seem needed)
> then try again:
> mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity
> The temp files left from the first run make a re-run impossible - we get: "Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory temp/weights already exists".
> Manually deleting the temp directory fixes this.
> I get same behaviour if I explicitly pass in a --tempdir path, e.g.:
> mahout rowsimilarity --input matrixified/matrix --output sims/ --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD --excludeSelfSimilarity --tempDir tmp2/
> Presumably something like HadoopUtil.delete(getConf(),tempDirPath) is needed somewhere?  (and maybe --overwrite too ?)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira