You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Joris Geessels (JIRA)" <ji...@apache.org> on 2011/02/14 20:19:57 UTC

[jira] Created: (MAHOUT-610) Not all Coocurrences provided to SimilarityReducer

Not all Coocurrences provided to SimilarityReducer
--------------------------------------------------

                 Key: MAHOUT-610
                 URL: https://issues.apache.org/jira/browse/MAHOUT-610
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering, Math
            Reporter: Joris Geessels
            Assignee: Sean Owen


While doing some tests with the RecommenderJob, and more specifically the RowSimilarityJob, I noticed that in some cases not all cooccurences are used in the similarity calculations ( done in the SimilarityReducer class ).
A RowPair object with (RowA=1,RowB=2) isn't considered the same as (RowA=2,RowB=1). This causes problems as CoocurencesMapper sometimes emits rowpairs in the first form and sometimes in the second form thus separating the cooccurences. If I'm right, this is due to the fact that ordering of the WeightedCoocurrenceArray for one column isn't guaranteed to be the same as for another column.
The solution is very simple, either you can change the compare method of the RowPair class or you can adapt the CooccurencesMapper to enforce that RowA < RowB.

Hope I've not missed something obvious, and that this is intended behavior. If this is the case, please enlighten me :-)

Also, slightly off topic. While doing these tests, I've noticed that the predictions are all remarkably high and the RMSE on the movielens 100k dataset lies around 1,6.
A bit to high if you ask me. Are these normal values or am I doing something wrong?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-610) Not all Coocurrences provided to SimilarityReducer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-610.
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Sebastian Schelter  (was: Sean Owen)

Looks like Sebastian fixed this

> Not all Coocurrences provided to SimilarityReducer
> --------------------------------------------------
>
>                 Key: MAHOUT-610
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-610
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering, Math
>            Reporter: Joris Geessels
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: mahout-610.patch
>
>
> While doing some tests with the RecommenderJob, and more specifically the RowSimilarityJob, I noticed that in some cases not all cooccurences are used in the similarity calculations ( done in the SimilarityReducer class ).
> A RowPair object with (RowA=1,RowB=2) isn't considered the same as (RowA=2,RowB=1). This causes problems as CoocurencesMapper sometimes emits rowpairs in the first form and sometimes in the second form thus separating the cooccurences. If I'm right, this is due to the fact that ordering of the WeightedCoocurrenceArray for one column isn't guaranteed to be the same as for another column.
> The solution is very simple, either you can change the compare method of the RowPair class or you can adapt the CooccurencesMapper to enforce that RowA < RowB.
> Hope I've not missed something obvious, and that this is intended behavior. If this is the case, please enlighten me :-)
> Also, slightly off topic. While doing these tests, I've noticed that the predictions are all remarkably high and the RMSE on the movielens 100k dataset lies around 1,6.
> A bit to high if you ask me. Are these normal values or am I doing something wrong?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-610) Not all Coocurrences provided to SimilarityReducer

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995478#comment-12995478 ] 

Sebastian Schelter commented on MAHOUT-610:
-------------------------------------------

fix is committed, I also added some counters.

> Not all Coocurrences provided to SimilarityReducer
> --------------------------------------------------
>
>                 Key: MAHOUT-610
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-610
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering, Math
>            Reporter: Joris Geessels
>            Assignee: Sean Owen
>         Attachments: mahout-610.patch
>
>
> While doing some tests with the RecommenderJob, and more specifically the RowSimilarityJob, I noticed that in some cases not all cooccurences are used in the similarity calculations ( done in the SimilarityReducer class ).
> A RowPair object with (RowA=1,RowB=2) isn't considered the same as (RowA=2,RowB=1). This causes problems as CoocurencesMapper sometimes emits rowpairs in the first form and sometimes in the second form thus separating the cooccurences. If I'm right, this is due to the fact that ordering of the WeightedCoocurrenceArray for one column isn't guaranteed to be the same as for another column.
> The solution is very simple, either you can change the compare method of the RowPair class or you can adapt the CooccurencesMapper to enforce that RowA < RowB.
> Hope I've not missed something obvious, and that this is intended behavior. If this is the case, please enlighten me :-)
> Also, slightly off topic. While doing these tests, I've noticed that the predictions are all remarkably high and the RMSE on the movielens 100k dataset lies around 1,6.
> A bit to high if you ask me. Are these normal values or am I doing something wrong?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-610) Not all Coocurrences provided to SimilarityReducer

Posted by "Joris Geessels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Geessels updated MAHOUT-610:
----------------------------------

    Attachment: mahout-610.patch

The attached patch solves imo the issue. I've chosen not to modify the WeightedRowPair class as I can imagine that in some cases the behavior of the currently implemented compareTo method is desired.

> Not all Coocurrences provided to SimilarityReducer
> --------------------------------------------------
>
>                 Key: MAHOUT-610
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-610
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering, Math
>            Reporter: Joris Geessels
>            Assignee: Sean Owen
>         Attachments: mahout-610.patch
>
>
> While doing some tests with the RecommenderJob, and more specifically the RowSimilarityJob, I noticed that in some cases not all cooccurences are used in the similarity calculations ( done in the SimilarityReducer class ).
> A RowPair object with (RowA=1,RowB=2) isn't considered the same as (RowA=2,RowB=1). This causes problems as CoocurencesMapper sometimes emits rowpairs in the first form and sometimes in the second form thus separating the cooccurences. If I'm right, this is due to the fact that ordering of the WeightedCoocurrenceArray for one column isn't guaranteed to be the same as for another column.
> The solution is very simple, either you can change the compare method of the RowPair class or you can adapt the CooccurencesMapper to enforce that RowA < RowB.
> Hope I've not missed something obvious, and that this is intended behavior. If this is the case, please enlighten me :-)
> Also, slightly off topic. While doing these tests, I've noticed that the predictions are all remarkably high and the RMSE on the movielens 100k dataset lies around 1,6.
> A bit to high if you ask me. Are these normal values or am I doing something wrong?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-610) Not all Coocurrences provided to SimilarityReducer

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009635#comment-13009635 ] 

Sebastian Schelter commented on MAHOUT-610:
-------------------------------------------

yes it's fixed

> Not all Coocurrences provided to SimilarityReducer
> --------------------------------------------------
>
>                 Key: MAHOUT-610
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-610
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering, Math
>            Reporter: Joris Geessels
>            Assignee: Sebastian Schelter
>             Fix For: 0.5
>
>         Attachments: mahout-610.patch
>
>
> While doing some tests with the RecommenderJob, and more specifically the RowSimilarityJob, I noticed that in some cases not all cooccurences are used in the similarity calculations ( done in the SimilarityReducer class ).
> A RowPair object with (RowA=1,RowB=2) isn't considered the same as (RowA=2,RowB=1). This causes problems as CoocurencesMapper sometimes emits rowpairs in the first form and sometimes in the second form thus separating the cooccurences. If I'm right, this is due to the fact that ordering of the WeightedCoocurrenceArray for one column isn't guaranteed to be the same as for another column.
> The solution is very simple, either you can change the compare method of the RowPair class or you can adapt the CooccurencesMapper to enforce that RowA < RowB.
> Hope I've not missed something obvious, and that this is intended behavior. If this is the case, please enlighten me :-)
> Also, slightly off topic. While doing these tests, I've noticed that the predictions are all remarkably high and the RMSE on the movielens 100k dataset lies around 1,6.
> A bit to high if you ask me. Are these normal values or am I doing something wrong?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-610) Not all Coocurrences provided to SimilarityReducer

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994463#comment-12994463 ] 

Sebastian Schelter commented on MAHOUT-610:
-------------------------------------------

Hi Joris,

I looked through the code and I think you're right and you found a serious bug. Thank you very much for this. 

I'm only wondering why all the tests that use RowSimilarityJob internally always worked. My guess is that's because CooccurrencesMapper implicitly relies on the weightedOccurrences in its weightedOccurrenceArray being sorted by row ascending. If we only run the M/R code as a local hadoop job, this might be the case because it's the natural order in which the weightedOccurrences are mapped out. As you already said there's no ordering guarantee in a really distributed environment. I will run some verification tests for this later this week and than I'll commit your fix. 

> Not all Coocurrences provided to SimilarityReducer
> --------------------------------------------------
>
>                 Key: MAHOUT-610
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-610
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering, Math
>            Reporter: Joris Geessels
>            Assignee: Sean Owen
>         Attachments: mahout-610.patch
>
>
> While doing some tests with the RecommenderJob, and more specifically the RowSimilarityJob, I noticed that in some cases not all cooccurences are used in the similarity calculations ( done in the SimilarityReducer class ).
> A RowPair object with (RowA=1,RowB=2) isn't considered the same as (RowA=2,RowB=1). This causes problems as CoocurencesMapper sometimes emits rowpairs in the first form and sometimes in the second form thus separating the cooccurences. If I'm right, this is due to the fact that ordering of the WeightedCoocurrenceArray for one column isn't guaranteed to be the same as for another column.
> The solution is very simple, either you can change the compare method of the RowPair class or you can adapt the CooccurencesMapper to enforce that RowA < RowB.
> Hope I've not missed something obvious, and that this is intended behavior. If this is the case, please enlighten me :-)
> Also, slightly off topic. While doing these tests, I've noticed that the predictions are all remarkably high and the RMSE on the movielens 100k dataset lies around 1,6.
> A bit to high if you ask me. Are these normal values or am I doing something wrong?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-610) Not all Coocurrences provided to SimilarityReducer

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995238#comment-12995238 ] 

Sebastian Schelter commented on MAHOUT-610:
-------------------------------------------

bug is confirmed. will commit the fix these days.

> Not all Coocurrences provided to SimilarityReducer
> --------------------------------------------------
>
>                 Key: MAHOUT-610
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-610
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering, Math
>            Reporter: Joris Geessels
>            Assignee: Sean Owen
>         Attachments: mahout-610.patch
>
>
> While doing some tests with the RecommenderJob, and more specifically the RowSimilarityJob, I noticed that in some cases not all cooccurences are used in the similarity calculations ( done in the SimilarityReducer class ).
> A RowPair object with (RowA=1,RowB=2) isn't considered the same as (RowA=2,RowB=1). This causes problems as CoocurencesMapper sometimes emits rowpairs in the first form and sometimes in the second form thus separating the cooccurences. If I'm right, this is due to the fact that ordering of the WeightedCoocurrenceArray for one column isn't guaranteed to be the same as for another column.
> The solution is very simple, either you can change the compare method of the RowPair class or you can adapt the CooccurencesMapper to enforce that RowA < RowB.
> Hope I've not missed something obvious, and that this is intended behavior. If this is the case, please enlighten me :-)
> Also, slightly off topic. While doing these tests, I've noticed that the predictions are all remarkably high and the RMSE on the movielens 100k dataset lies around 1,6.
> A bit to high if you ask me. Are these normal values or am I doing something wrong?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAHOUT-610) Not all Coocurrences provided to SimilarityReducer

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995531#comment-12995531 ] 

Hudson commented on MAHOUT-610:
-------------------------------

Integrated in Mahout-Quality #631 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/631/])
    MAHOUT-610 Not all Coocurrences provided to SimilarityReducer


> Not all Coocurrences provided to SimilarityReducer
> --------------------------------------------------
>
>                 Key: MAHOUT-610
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-610
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering, Math
>            Reporter: Joris Geessels
>            Assignee: Sean Owen
>         Attachments: mahout-610.patch
>
>
> While doing some tests with the RecommenderJob, and more specifically the RowSimilarityJob, I noticed that in some cases not all cooccurences are used in the similarity calculations ( done in the SimilarityReducer class ).
> A RowPair object with (RowA=1,RowB=2) isn't considered the same as (RowA=2,RowB=1). This causes problems as CoocurencesMapper sometimes emits rowpairs in the first form and sometimes in the second form thus separating the cooccurences. If I'm right, this is due to the fact that ordering of the WeightedCoocurrenceArray for one column isn't guaranteed to be the same as for another column.
> The solution is very simple, either you can change the compare method of the RowPair class or you can adapt the CooccurencesMapper to enforce that RowA < RowB.
> Hope I've not missed something obvious, and that this is intended behavior. If this is the case, please enlighten me :-)
> Also, slightly off topic. While doing these tests, I've noticed that the predictions are all remarkably high and the RMSE on the movielens 100k dataset lies around 1,6.
> A bit to high if you ask me. Are these normal values or am I doing something wrong?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira