You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Peng Cheng (JIRA)" <ji...@apache.org> on 2013/08/12 18:44:49 UTC

[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

    [ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737019#comment-13737019 ] 

Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:44 PM:
-------------------------------------------------------------

Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some difficulties, namely 1) a columnar form doesn't support fast extraction of rows, yet dataModel should allow quick getPreferencesFromUser() and getPreferencesForItem(). 2) a columnar form doesn't support fast online update (time complexity is O( n ), maximally O( log n ) if using block copy and columns are sorted). 3) To create such dataModel we need to initialize a HashMap first, this uses twice as much as heap space for initialization, could defeat the purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him for some time.

The search based recommender is indeed a very tempting solution. I'm very sure it is an all-improving solution to similarity-based recommenders. But low rank matrix-factorization based ones should merge preferences from the new users immediately into the prediction model, of course you can just project it into the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according to guys I'm working with the online recommender seems to be in demand these days.
                
      was (Author: peng):
    Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some difficulties, namely 1) a columnar form doesn't support fast extraction of rows, yet dataModel should allow quick getPreferencesFromUser() and getPreferencesForItem(). 2) a columnar form doesn't support fast online update (time complexity is O(n), maximally O(n) if using block copy and columns are sorted). 3) To create such dataModel we need to initialize a HashMap first, this uses twice as much as heap space for initialization, could defeat the purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him for some time.

The search based recommender is indeed a very tempting solution. I'm very sure it is an all-improving solution to similarity-based recommenders. But low rank matrix-factorization based ones should merge preferences from the new users immediately into the prediction model, of course you can just project it into the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according to guys I'm working with the online recommender seems to be in demand these days.
                  
> Memory-efficient DataModel, supporting fast online updates and element-wise iteration
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1286
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.9
>            Reporter: Peng Cheng
>            Assignee: Sean Owen
>              Labels: collaborative-filtering, datamodel, patch, recommender
>             Fix For: 0.9
>
>         Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira