You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Julien Le Dem (JIRA)" <ji...@apache.org> on 2009/08/07 18:45:14 UTC

[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740629#action_12740629 ] 

Julien Le Dem commented on MAHOUT-106:
--------------------------------------

Hi,
First of all, thanks a lot to Prasen for this PLSI implementation :)
2 comments:

1) As is, it just works in pig local mode and has a dependency on Python.
I suggest removing the dependency on Python and update the scripts so it runs also in mapred mode.
If you agree I can propose an updated patch.

2) I've been looking at the complexity of the algorithm.
The computation of Q* produces as many records as number of users * number of stories * number of values of z which get quickly to a pretty big number.
The article states it's been run on a dataset of 61265*1623*30 ~ 3E9 records for Q* I'm looking at the record count as opposed to operations because this is something that will cause IO and a bottleneck in the processing.
Have you tried running it on larger datasets ?
What optimization do you think can be applied to run on larger datasets ?

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.2
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.