You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Julien Le Dem (JIRA)" <ji...@apache.org> on 2009/08/07 18:45:14 UTC
[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's
ACM 04 paper.
[ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740629#action_12740629 ]
Julien Le Dem commented on MAHOUT-106:
--------------------------------------
Hi,
First of all, thanks a lot to Prasen for this PLSI implementation :)
2 comments:
1) As is, it just works in pig local mode and has a dependency on Python.
I suggest removing the dependency on Python and update the scripts so it runs also in mapred mode.
If you agree I can propose an updated patch.
2) I've been looking at the complexity of the algorithm.
The computation of Q* produces as many records as number of users * number of stories * number of values of z which get quickly to a pretty big number.
The article states it's been run on a dataset of 61265*1623*30 ~ 3E9 records for Q* I'm looking at the record count as opposed to operations because this is something that will cause IO and a bottleneck in the processing.
Have you tried running it on larger datasets ?
What optimization do you think can be applied to run on larger datasets ?
> PLSI/EM in pig based on hofmann's ACM 04 paper.
> ------------------------------------------------
>
> Key: MAHOUT-106
> URL: https://issues.apache.org/jira/browse/MAHOUT-106
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Environment: Pig/Hadoop
> Reporter: Prasen Mukherjee
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.2
>
> Attachments: plsi_pig.patch
>
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004, vol 22(1), pp. 89-115.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.