You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sebastian Schelter (JIRA)" <ji...@apache.org> on 2011/02/03 22:30:30 UTC

[jira] Commented: (MAHOUT-577) RowSimilarityJob hangs during CooccurrencesMapper

    [ https://issues.apache.org/jira/browse/MAHOUT-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990318#comment-12990318 ] 

Sebastian Schelter commented on MAHOUT-577:
-------------------------------------------

RowSimilarityJob has the nice feature that it will only compute similarities for rows that have at least one element in common (= there exists at least one column in which both rows have an entry). It tries to avoid comparing each row with each other so I'd say its thought to work on sparse matrices only. It will be slower than the naive approach of comparing each row with each other on dense matrices, it should not be used as described in the issue here.

I agree with you that a lot of small tweaks might be applyable and that intelligent sampling techniques could help a lot depending on the usecase.

> RowSimilarityJob hangs during CooccurrencesMapper
> -------------------------------------------------
>
>                 Key: MAHOUT-577
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-577
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>         Environment: Linux Debian 5.0.5, 12GB Ram, Hadoop 20.3 installation 
>            Reporter: Maya Hristakeva
>             Fix For: 0.5
>
>
> Hello,
> When trying to run a RowSimilarityJob on a matrix ( 146682 x 138351 ), the job gets through the RowWeightMapper and WeightedOccurrencesPerColumnReducer, and hangs during the CooccurrencesMapper although it shows that the map tasks are 100% complete. 
> The command I use to run the job is: 
> hadoop jar mahout-core-0.4-job.jar org.apache.mahout.math.hadoop.similarity.RowSimilarityJob -Dmapred.input.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaCompressedDocumentsMatrix -Dmapred.output.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaDocumentSimilarityMatrix -Dmapred.reduce.tasks=8 -Dmapred.map.tasks=200 -Dmapred.job.name=LDA_ROW_SIMILARITY_TEST --tempDir /user/maya.hristakeva/temp/lda/5 --numberOfColumns 138351 --similarityClassname org.apache.mahout.math.hadoop.similarity.vector.DistributedEuclideanDistanceVectorSimilarity --maxSimilaritiesPerRow 10
> And the output of the mappers which are 100% complete, but hanging is: 
> syslog logs
> 01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: bufstart = 29085149; bufend = 39038598; bufvoid = 99614720
> 2011-01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: kvstart = 65461; kvend = 327605; length = 327680
> 2011-01-05 18:30:06,241 INFO org.apache.hadoop.mapred.MapTask: Finished spill 94
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: bufstart = 39038598; bufend = 48983989; bufvoid = 99614720
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: kvstart = 327605; kvend = 262068; length = 327680
> 2011-01-05 18:30:14,528 INFO org.apache.hadoop.mapred.MapTask: Finished spill 95
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: bufstart = 48983989; bufend = 58929384; bufvoid = 99614720
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: kvstart = 262068; kvend = 196531; length = 327680
> 2011-01-05 18:30:22,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill 96
> .
> .
> .
> This problem does not occur when I use a toy matrix of 100 x 100, but once I give it the original matrix of ..... the problem is always reproducible. 
> Any ideas on what could be causing this? 
> Thanks, 
> Maya Hristakeva

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira