You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "CodyInnowhere (JIRA)" <ji...@apache.org> on 2012/06/15 10:55:42 UTC

[jira] [Created] (MAHOUT-1032) AggregateAndRecommendReducer gets OOM in setup() method

CodyInnowhere created MAHOUT-1032:
-------------------------------------

             Summary: AggregateAndRecommendReducer gets OOM in setup() method
                 Key: MAHOUT-1032
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1032
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering
    Affects Versions: 0.6, 0.5, 0.7, 0.8
         Environment: hadoop cluster with -Xmx set to 2G
            Reporter: CodyInnowhere
            Assignee: Sean Owen


This bug is actually caused by the very first job: itemIDIndex. This job transfers itemID to an integer index, and in the later AggregateAndRecommendReducer, tries to read all items to the OpenIntLongHashMap indexItemIDMap. However, for large data sets, e.g., my test data set covers 100million+ items(not too many items for a large e-commerce website), tasks get out of memory in setup() method. I don't think the itemIDIndex is necessary, without this job, the final AggregateAndRecommend step doesn't have to read all items to the memory to do the reverse index mapping.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1032) AggregateAndRecommendReducer gets OOM in setup() method

Posted by "CodyInnowhere (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295584#comment-13295584 ] 

CodyInnowhere commented on MAHOUT-1032:
---------------------------------------

Well we have a billion+ distinct items(www.taobao.com), the test data set is a part of online items. I see the reason for the index mapping, however, this makes enterprise-scale data set a bit difficult to fit in mahout CF. 
BTW, the index mapping is also a problem for so many items as our itemId may exceed int.MAX.
                
> AggregateAndRecommendReducer gets OOM in setup() method
> -------------------------------------------------------
>
>                 Key: MAHOUT-1032
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1032
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.5, 0.6, 0.7, 0.8
>         Environment: hadoop cluster with -Xmx set to 2G
>            Reporter: CodyInnowhere
>            Assignee: Sean Owen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This bug is actually caused by the very first job: itemIDIndex. This job transfers itemID to an integer index, and in the later AggregateAndRecommendReducer, tries to read all items to the OpenIntLongHashMap indexItemIDMap. However, for large data sets, e.g., my test data set covers 100million+ items(not too many items for a large e-commerce website), tasks get out of memory in setup() method. I don't think the itemIDIndex is necessary, without this job, the final AggregateAndRecommend step doesn't have to read all items to the memory to do the reverse index mapping.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1032) AggregateAndRecommendReducer gets OOM in setup() method

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295593#comment-13295593 ] 

Sean Owen commented on MAHOUT-1032:
-----------------------------------

Yeah I can imagine having a billion distinct IDs -- but they may not be the logical entities you recommend on. I'm suggesting that it's unlikely there are 'really' a billion items and you'll benefit a lot by collapsing them, perhaps. No matter what approach, it will almost surely save you $$ in processing.

Put another way, unless you have trillions of data points, this data set is going to be very sparse, so much that the result may not be so useful. What's the average number of interactions per user or item?

You can rewrite these bits to do lookups via a M/R join. It will take more time, is all, since the whole data set is mapped out again, mapped with copies of the lookup, joined, and then finally output. Most any situation where something is loaded in memory is 'cheating', but a very useful speedup since it works fine up to 'merely huge' numbers of items, like 10M.
                
> AggregateAndRecommendReducer gets OOM in setup() method
> -------------------------------------------------------
>
>                 Key: MAHOUT-1032
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1032
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.5, 0.6, 0.7, 0.8
>         Environment: hadoop cluster with -Xmx set to 2G
>            Reporter: CodyInnowhere
>            Assignee: Sean Owen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This bug is actually caused by the very first job: itemIDIndex. This job transfers itemID to an integer index, and in the later AggregateAndRecommendReducer, tries to read all items to the OpenIntLongHashMap indexItemIDMap. However, for large data sets, e.g., my test data set covers 100million+ items(not too many items for a large e-commerce website), tasks get out of memory in setup() method. I don't think the itemIDIndex is necessary, without this job, the final AggregateAndRecommend step doesn't have to read all items to the memory to do the reverse index mapping.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1032) AggregateAndRecommendReducer gets OOM in setup() method

Posted by "CodyInnowhere (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295596#comment-13295596 ] 

CodyInnowhere commented on MAHOUT-1032:
---------------------------------------

The data is very sparse indeed. I think I should do more pruning before try this algorithm again.
                
> AggregateAndRecommendReducer gets OOM in setup() method
> -------------------------------------------------------
>
>                 Key: MAHOUT-1032
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1032
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.5, 0.6, 0.7, 0.8
>         Environment: hadoop cluster with -Xmx set to 2G
>            Reporter: CodyInnowhere
>            Assignee: Sean Owen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This bug is actually caused by the very first job: itemIDIndex. This job transfers itemID to an integer index, and in the later AggregateAndRecommendReducer, tries to read all items to the OpenIntLongHashMap indexItemIDMap. However, for large data sets, e.g., my test data set covers 100million+ items(not too many items for a large e-commerce website), tasks get out of memory in setup() method. I don't think the itemIDIndex is necessary, without this job, the final AggregateAndRecommend step doesn't have to read all items to the memory to do the reverse index mapping.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1032) AggregateAndRecommendReducer gets OOM in setup() method

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295579#comment-13295579 ] 

Sean Owen commented on MAHOUT-1032:
-----------------------------------

(100M items is a very large number actually -- not even Amazon sells nearly that many SKUs. What site would do so? It may be that the same item exists under a thousand different IDs, in which case a solution is to combine data for the same item ahead of time. That will give more than just performance benefit.)

It is necessary, because the IDs at this stage are not the same IDs that were contained in the input. They are hashed and need to be un-hashed. The hash happens to map longs to the same int, for values smaller than an int's max value. If your IDs are really ints, yes you could chop it out with no effect. I'd do that on your local copy if you can make that assumption.

Alternatively -- give your workers some more memory? I would imagine it's not like you need 8GB to fit this in memory. Maybe that's feasible.


This is indeed a somewhat ugly step, and it's needed because the Vectors are indexed by int.

FWIW when I reimplemented these sorts of algorithms in Myrrix (myrrix.com) I started by creating primitives for vectors and matrices indexed by long. There is no such translation needed as a result. Although it's going to have a different, lesser bottleneck, in reading the item-feature matrix into memory. The workers would need probably 40GB to hold this in memory -- possible but really pushing it.

Even so it's possible to rewrite it (and many algorithms) to not load such things in memory and use MR joins instead to do work instead of in-memory lookups. It's a tradeoff, and is much much slower to do so. 


That is to say... it's certainly possible, but probably infeasibly resource intensive to run algos like this on a data model of 100M items. It's conceivable but almost surely better to engage in some preprocessing at that point to cut down the number of distinct items.
                
> AggregateAndRecommendReducer gets OOM in setup() method
> -------------------------------------------------------
>
>                 Key: MAHOUT-1032
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1032
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.5, 0.6, 0.7, 0.8
>         Environment: hadoop cluster with -Xmx set to 2G
>            Reporter: CodyInnowhere
>            Assignee: Sean Owen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This bug is actually caused by the very first job: itemIDIndex. This job transfers itemID to an integer index, and in the later AggregateAndRecommendReducer, tries to read all items to the OpenIntLongHashMap indexItemIDMap. However, for large data sets, e.g., my test data set covers 100million+ items(not too many items for a large e-commerce website), tasks get out of memory in setup() method. I don't think the itemIDIndex is necessary, without this job, the final AggregateAndRecommend step doesn't have to read all items to the memory to do the reverse index mapping.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-1032) AggregateAndRecommendReducer gets OOM in setup() method

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-1032.
-------------------------------

    Resolution: Not A Problem
    
> AggregateAndRecommendReducer gets OOM in setup() method
> -------------------------------------------------------
>
>                 Key: MAHOUT-1032
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1032
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.5, 0.6, 0.7, 0.8
>         Environment: hadoop cluster with -Xmx set to 2G
>            Reporter: CodyInnowhere
>            Assignee: Sean Owen
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This bug is actually caused by the very first job: itemIDIndex. This job transfers itemID to an integer index, and in the later AggregateAndRecommendReducer, tries to read all items to the OpenIntLongHashMap indexItemIDMap. However, for large data sets, e.g., my test data set covers 100million+ items(not too many items for a large e-commerce website), tasks get out of memory in setup() method. I don't think the itemIDIndex is necessary, without this job, the final AggregateAndRecommend step doesn't have to read all items to the memory to do the reverse index mapping.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira