You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "lariven (JIRA)" <ji...@apache.org> on 2015/06/13 12:45:00 UTC

[jira] [Comment Edited] (MAHOUT-1739) maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct

    [ https://issues.apache.org/jira/browse/MAHOUT-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584549#comment-14584549 ] 

lariven edited comment on MAHOUT-1739 at 6/13/15 10:44 AM:
-----------------------------------------------------------

the unit test in the project is at hand to use.
mvn test -Dtest=org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJobTest

how to reproduce the bug:
 step 1, at line 210 add two records to test data:

    writeLines(inputFile,
        "1,1,1",
            "1,4,1",//added
            "2,4,1",//added
        "1,3,1",
        "2,2,1",
        "2,3,1",
        "3,1,1",
        "3,2,1",
        "4,1,1",
        "4,2,1",
        "4,3,1",
        "5,2,1",
        "6,1,1",
        "6,2,1");

 step 2, set 231 line maxSimilaritiesPerItem from 1 to 2:
231        TanimotoCoefficat cientSimilarity.class.getName(), "--maxSimilaritiesPerItem", "2" });

we expect output:
1       2       0.5
1       3       0.4
2       1       0.5
2       3       0.3333333333333333
3       1       0.4
3       4       0.6666666666666666
4       1       0.2
4       3       0.6666666666666666


but output:
1       2       0.5
1       3       0.4
1       4       0.2
2       3       0.3333333333333333
3       4       0.6666666666666666


why:

the weird switch of itemID with otherItemID. this may loss some target items of it's similars and append same similars to other target items.


was (Author: lariven):
the unit test in the project is at hand to use.
mvn test -Dtest=org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJobTest

how to reproduce the bug:
 step 1, at line 210 add two records to test data:

    writeLines(inputFile,
        "1,1,1",
            "1,4,1",//added
            "2,4,1",//added
        "1,3,1",
        "2,2,1",
        "2,3,1",
        "3,1,1",
        "3,2,1",
        "4,1,1",
        "4,2,1",
        "4,3,1",
        "5,2,1",
        "6,1,1",
        "6,2,1");

 step 2, set 231 line maxSimilaritiesPerItem from 1 to 2:
231        TanimotoCoefficat cientSimilarity.class.getName(), "--maxSimilaritiesPerItem", "2" });

we expect output:
1       2       0.5
1       3       0.4
2       1       0.5
2       3       0.3333333333333333
3       1       0.4
3       4       0.6666666666666666
4       1       0.2
4       3       0.6666666666666666


but output:
1       2       0.5
1       3       0.4
1       4       0.2
2       3       0.3333333333333333
3       4       0.6666666666666666


why:

the switch of itemID with otherItemID. this may loss some target items of it's similars and append same similars to other target items.

> maxSimilarItemsPerItem param of ItemSimilarityJob doesn't behave correct
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-1739
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1739
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.10.0
>            Reporter: lariven
>              Labels: easyfix, patch
>         Attachments: fix_maxSimilarItemsPerItem_incorrect_behave.patch
>
>
> the output similar items of ItemSimilarityJob for each target item may exceed the number of similar items we set to maxSimilarItemsPerItem  parameter. the following code of ItemSimilarityJob.java about line NO. 200 may affect:
>         if (itemID < otherItemID) {
>           ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity()));
>         } else {
>           ctx.write(new EntityEntityWritable(otherItemID, itemID), new DoubleWritable(similarItem.getSimilarity()));
>         }
> Don't know why need to switch itemID with otherItemID, but I think a single line is enough:
>           ctx.write(new EntityEntityWritable(itemID, otherItemID), new DoubleWritable(similarItem.getSimilarity()));



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)