You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Gokhan Capan (JIRA)" <ji...@apache.org> on 2012/09/18 14:20:07 UTC

[jira] [Created] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

Gokhan Capan created MAHOUT-1069:
------------------------------------

             Summary: Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
                 Key: MAHOUT-1069
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1069
             Project: Mahout
          Issue Type: Improvement
          Components: CLI, Collaborative Filtering
    Affects Versions: 0.8
            Reporter: Gokhan Capan
            Assignee: Sean Owen


Upon our conversations on dev-list, I would like to state that I have completed the merge of the recommender algorithms that is mentioned in http://goo.gl/fh4d9 to mahout. 

These are a set of learning algorithms for matrix factorization based recommendation, which are capable of:

* Recommending multiple targets:
*# Numerical Recommendation with OLS Regression
*# Binary Recommendation with Logistic Regression
*# Multinomial Recommendation with Softmax Regression
*# Ordinal Recommendation with Proportional Odds Model

* Leveraging side info in mahout vector format where available
*# User side information
*# Item side information
*# Dynamic side information (side info at feedback moment, such as proximity, day of week etc.)

* Online learning

Some command-line tools are provided as mahout jobs, for pre-experiment utilities and running experiments.

Evaluation tools for numerical and categorical recommenders are added.

A simple example for Movielens-1M data is provided, and it achieved pretty good results (0.851 RMSE in a randomly generated test data after some validation to determine learning and regularization rates on a separate validation data)

There is no modification in the existing Mahout code, except the added lines in driver.class.props for command-line tools. However, that became a huge patch with dozens of new source files.

These algorithms are highly inspired from various influential Recommender System papers, especially Yehuda Koren's. For example, the Ordinal model is from Koren's OrdRec paper, except the cuts are not user-specific but global.

Left for future:
# The core algorithms are tested, but there probably exists some parts those tests do not cover. I saw many of those in action without problem, but I am going to add new tests regularly.
# Not all algorithms have been tried on appropriate datasets, and they may need some improvement. However, I use the algorithms also for my M.Sc. thesis, which means I will eventually submit more experiments. As the experimenting infrastructure exists, I believe community may provide more experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-1069:
------------------------------

    Assignee:     (was: Sean Owen)
    
> Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
> -------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1069
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1069
>             Project: Mahout
>          Issue Type: Improvement
>          Components: CLI, Collaborative Filtering
>    Affects Versions: 0.8
>            Reporter: Gokhan Capan
>              Labels: cf, improvement, sgd
>         Attachments: MAHOUT-1069.patch, MAHOUT-1069.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Upon our conversations on dev-list, I would like to state that I have completed the merge of the recommender algorithms that is mentioned in http://goo.gl/fh4d9 to mahout. 
> These are a set of learning algorithms for matrix factorization based recommendation, which are capable of:
> * Recommending multiple targets:
> *# Numerical Recommendation with OLS Regression
> *# Binary Recommendation with Logistic Regression
> *# Multinomial Recommendation with Softmax Regression
> *# Ordinal Recommendation with Proportional Odds Model
> * Leveraging side info in mahout vector format where available
> *# User side information
> *# Item side information
> *# Dynamic side information (side info at feedback moment, such as proximity, day of week etc.)
> * Online learning
> Some command-line tools are provided as mahout jobs, for pre-experiment utilities and running experiments.
> Evaluation tools for numerical and categorical recommenders are added.
> A simple example for Movielens-1M data is provided, and it achieved pretty good results (0.851 RMSE in a randomly generated test data after some validation to determine learning and regularization rates on a separate validation data)
> There is no modification in the existing Mahout code, except the added lines in driver.class.props for command-line tools. However, that became a huge patch with dozens of new source files.
> These algorithms are highly inspired from various influential Recommender System papers, especially Yehuda Koren's. For example, the Ordinal model is from Koren's OrdRec paper, except the cuts are not user-specific but global.
> Left for future:
> # The core algorithms are tested, but there probably exists some parts those tests do not cover. I saw many of those in action without problem, but I am going to add new tests regularly.
> # Not all algorithms have been tried on appropriate datasets, and they may need some improvement. However, I use the algorithms also for my M.Sc. thesis, which means I will eventually submit more experiments. As the experimenting infrastructure exists, I believe community may provide more experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504453#comment-13504453 ] 

Sebastian Schelter commented on MAHOUT-1069:
--------------------------------------------

Hi Gokhan,

I finally found the time for a quick look into your code. It looks really interesting and it seems like you put a lot of effort into it. From a skim through your patch, I get the impression that the functionality could definitely be useful for a lot of people, especially the incorporation of side data.

Unfortunately, I agree with Sean that your code cannot be integrated as is. It would introduce a kind of interal "side-branch" in Mahout's recommender code. This is not your fault, a big part of the recommender code would need a refactoring to be able to easily integrate things like side information. 

So I think the best thing to get your code to the public would be to either publish it as a separate open source project that depends on Mahout (as Sean suggested) or integrating it step by step as small patches, which would be a tedious process.

I recently created a simple "weblayer" for Mahout's recommenders ( https://github.com/plista/kornakapi ) and also ran into some of the issues described here, when I tried to add folding-in of new users into factorization-based recommenders.



                
> Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
> -------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1069
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1069
>             Project: Mahout
>          Issue Type: Improvement
>          Components: CLI, Collaborative Filtering
>    Affects Versions: 0.8
>            Reporter: Gokhan Capan
>              Labels: cf, improvement, sgd
>         Attachments: MAHOUT-1069.patch, MAHOUT-1069.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Upon our conversations on dev-list, I would like to state that I have completed the merge of the recommender algorithms that is mentioned in http://goo.gl/fh4d9 to mahout. 
> These are a set of learning algorithms for matrix factorization based recommendation, which are capable of:
> * Recommending multiple targets:
> *# Numerical Recommendation with OLS Regression
> *# Binary Recommendation with Logistic Regression
> *# Multinomial Recommendation with Softmax Regression
> *# Ordinal Recommendation with Proportional Odds Model
> * Leveraging side info in mahout vector format where available
> *# User side information
> *# Item side information
> *# Dynamic side information (side info at feedback moment, such as proximity, day of week etc.)
> * Online learning
> Some command-line tools are provided as mahout jobs, for pre-experiment utilities and running experiments.
> Evaluation tools for numerical and categorical recommenders are added.
> A simple example for Movielens-1M data is provided, and it achieved pretty good results (0.851 RMSE in a randomly generated test data after some validation to determine learning and regularization rates on a separate validation data)
> There is no modification in the existing Mahout code, except the added lines in driver.class.props for command-line tools. However, that became a huge patch with dozens of new source files.
> These algorithms are highly inspired from various influential Recommender System papers, especially Yehuda Koren's. For example, the Ordinal model is from Koren's OrdRec paper, except the cuts are not user-specific but global.
> Left for future:
> # The core algorithms are tested, but there probably exists some parts those tests do not cover. I saw many of those in action without problem, but I am going to add new tests regularly.
> # Not all algorithms have been tried on appropriate datasets, and they may need some improvement. However, I use the algorithms also for my M.Sc. thesis, which means I will eventually submit more experiments. As the experimenting infrastructure exists, I believe community may provide more experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

Posted by "Gokhan Capan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gokhan Capan updated MAHOUT-1069:
---------------------------------

    Attachment: MAHOUT-1069.patch

Attached is the patch.
                
> Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
> -------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1069
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1069
>             Project: Mahout
>          Issue Type: Improvement
>          Components: CLI, Collaborative Filtering
>    Affects Versions: 0.8
>            Reporter: Gokhan Capan
>            Assignee: Sean Owen
>              Labels: cf, improvement, sgd
>         Attachments: MAHOUT-1069.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Upon our conversations on dev-list, I would like to state that I have completed the merge of the recommender algorithms that is mentioned in http://goo.gl/fh4d9 to mahout. 
> These are a set of learning algorithms for matrix factorization based recommendation, which are capable of:
> * Recommending multiple targets:
> *# Numerical Recommendation with OLS Regression
> *# Binary Recommendation with Logistic Regression
> *# Multinomial Recommendation with Softmax Regression
> *# Ordinal Recommendation with Proportional Odds Model
> * Leveraging side info in mahout vector format where available
> *# User side information
> *# Item side information
> *# Dynamic side information (side info at feedback moment, such as proximity, day of week etc.)
> * Online learning
> Some command-line tools are provided as mahout jobs, for pre-experiment utilities and running experiments.
> Evaluation tools for numerical and categorical recommenders are added.
> A simple example for Movielens-1M data is provided, and it achieved pretty good results (0.851 RMSE in a randomly generated test data after some validation to determine learning and regularization rates on a separate validation data)
> There is no modification in the existing Mahout code, except the added lines in driver.class.props for command-line tools. However, that became a huge patch with dozens of new source files.
> These algorithms are highly inspired from various influential Recommender System papers, especially Yehuda Koren's. For example, the Ordinal model is from Koren's OrdRec paper, except the cuts are not user-specific but global.
> Left for future:
> # The core algorithms are tested, but there probably exists some parts those tests do not cover. I saw many of those in action without problem, but I am going to add new tests regularly.
> # Not all algorithms have been tried on appropriate datasets, and they may need some improvement. However, I use the algorithms also for my M.Sc. thesis, which means I will eventually submit more experiments. As the experimenting infrastructure exists, I believe community may provide more experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468837#comment-13468837 ] 

Otis Gospodnetic commented on MAHOUT-1069:
------------------------------------------

I didn't look at the patch, but wouldn't that require Gokhan to grab a chunk of Mahout, too?

                
> Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
> -------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1069
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1069
>             Project: Mahout
>          Issue Type: Improvement
>          Components: CLI, Collaborative Filtering
>    Affects Versions: 0.8
>            Reporter: Gokhan Capan
>            Assignee: Sean Owen
>              Labels: cf, improvement, sgd
>         Attachments: MAHOUT-1069.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Upon our conversations on dev-list, I would like to state that I have completed the merge of the recommender algorithms that is mentioned in http://goo.gl/fh4d9 to mahout. 
> These are a set of learning algorithms for matrix factorization based recommendation, which are capable of:
> * Recommending multiple targets:
> *# Numerical Recommendation with OLS Regression
> *# Binary Recommendation with Logistic Regression
> *# Multinomial Recommendation with Softmax Regression
> *# Ordinal Recommendation with Proportional Odds Model
> * Leveraging side info in mahout vector format where available
> *# User side information
> *# Item side information
> *# Dynamic side information (side info at feedback moment, such as proximity, day of week etc.)
> * Online learning
> Some command-line tools are provided as mahout jobs, for pre-experiment utilities and running experiments.
> Evaluation tools for numerical and categorical recommenders are added.
> A simple example for Movielens-1M data is provided, and it achieved pretty good results (0.851 RMSE in a randomly generated test data after some validation to determine learning and regularization rates on a separate validation data)
> There is no modification in the existing Mahout code, except the added lines in driver.class.props for command-line tools. However, that became a huge patch with dozens of new source files.
> These algorithms are highly inspired from various influential Recommender System papers, especially Yehuda Koren's. For example, the Ordinal model is from Koren's OrdRec paper, except the cuts are not user-specific but global.
> Left for future:
> # The core algorithms are tested, but there probably exists some parts those tests do not cover. I saw many of those in action without problem, but I am going to add new tests regularly.
> # Not all algorithms have been tried on appropriate datasets, and they may need some improvement. However, I use the algorithms also for my M.Sc. thesis, which means I will eventually submit more experiments. As the experimenting infrastructure exists, I believe community may provide more experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457769#comment-13457769 ] 

Sean Owen commented on MAHOUT-1069:
-----------------------------------

I imagine this is all great work. As I commented off-list, it is a big enough and even different enough beast that it feels like it should be a separate project. The Mahout code base is already uneven and sprawling and I think this would exacerbate that -- and not generate much "synergy" worth the effort of integration.
                
> Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
> -------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1069
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1069
>             Project: Mahout
>          Issue Type: Improvement
>          Components: CLI, Collaborative Filtering
>    Affects Versions: 0.8
>            Reporter: Gokhan Capan
>            Assignee: Sean Owen
>              Labels: cf, improvement, sgd
>         Attachments: MAHOUT-1069.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Upon our conversations on dev-list, I would like to state that I have completed the merge of the recommender algorithms that is mentioned in http://goo.gl/fh4d9 to mahout. 
> These are a set of learning algorithms for matrix factorization based recommendation, which are capable of:
> * Recommending multiple targets:
> *# Numerical Recommendation with OLS Regression
> *# Binary Recommendation with Logistic Regression
> *# Multinomial Recommendation with Softmax Regression
> *# Ordinal Recommendation with Proportional Odds Model
> * Leveraging side info in mahout vector format where available
> *# User side information
> *# Item side information
> *# Dynamic side information (side info at feedback moment, such as proximity, day of week etc.)
> * Online learning
> Some command-line tools are provided as mahout jobs, for pre-experiment utilities and running experiments.
> Evaluation tools for numerical and categorical recommenders are added.
> A simple example for Movielens-1M data is provided, and it achieved pretty good results (0.851 RMSE in a randomly generated test data after some validation to determine learning and regularization rates on a separate validation data)
> There is no modification in the existing Mahout code, except the added lines in driver.class.props for command-line tools. However, that became a huge patch with dozens of new source files.
> These algorithms are highly inspired from various influential Recommender System papers, especially Yehuda Koren's. For example, the Ordinal model is from Koren's OrdRec paper, except the cuts are not user-specific but global.
> Left for future:
> # The core algorithms are tested, but there probably exists some parts those tests do not cover. I saw many of those in action without problem, but I am going to add new tests regularly.
> # Not all algorithms have been tried on appropriate datasets, and they may need some improvement. However, I use the algorithms also for my M.Sc. thesis, which means I will eventually submit more experiments. As the experimenting infrastructure exists, I believe community may provide more experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

Posted by "Gokhan Capan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gokhan Capan updated MAHOUT-1069:
---------------------------------

    Attachment: MAHOUT-1069.patch

Fixed a few minor bugs and updated the patch
                
> Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
> -------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1069
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1069
>             Project: Mahout
>          Issue Type: Improvement
>          Components: CLI, Collaborative Filtering
>    Affects Versions: 0.8
>            Reporter: Gokhan Capan
>            Assignee: Sean Owen
>              Labels: cf, improvement, sgd
>         Attachments: MAHOUT-1069.patch, MAHOUT-1069.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Upon our conversations on dev-list, I would like to state that I have completed the merge of the recommender algorithms that is mentioned in http://goo.gl/fh4d9 to mahout. 
> These are a set of learning algorithms for matrix factorization based recommendation, which are capable of:
> * Recommending multiple targets:
> *# Numerical Recommendation with OLS Regression
> *# Binary Recommendation with Logistic Regression
> *# Multinomial Recommendation with Softmax Regression
> *# Ordinal Recommendation with Proportional Odds Model
> * Leveraging side info in mahout vector format where available
> *# User side information
> *# Item side information
> *# Dynamic side information (side info at feedback moment, such as proximity, day of week etc.)
> * Online learning
> Some command-line tools are provided as mahout jobs, for pre-experiment utilities and running experiments.
> Evaluation tools for numerical and categorical recommenders are added.
> A simple example for Movielens-1M data is provided, and it achieved pretty good results (0.851 RMSE in a randomly generated test data after some validation to determine learning and regularization rates on a separate validation data)
> There is no modification in the existing Mahout code, except the added lines in driver.class.props for command-line tools. However, that became a huge patch with dozens of new source files.
> These algorithms are highly inspired from various influential Recommender System papers, especially Yehuda Koren's. For example, the Ordinal model is from Koren's OrdRec paper, except the cuts are not user-specific but global.
> Left for future:
> # The core algorithms are tested, but there probably exists some parts those tests do not cover. I saw many of those in action without problem, but I am going to add new tests regularly.
> # Not all algorithms have been tried on appropriate datasets, and they may need some improvement. However, I use the algorithms also for my M.Sc. thesis, which means I will eventually submit more experiments. As the experimenting infrastructure exists, I believe community may provide more experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira