You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Prasen Mukherjee (JIRA)" <ji...@apache.org> on 2009/02/11 13:28:59 UTC

[jira] Created: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

PLSI/EM in pig based on hofmann's ACM 04 paper. 
------------------------------------------------

                 Key: MAHOUT-106
                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
             Project: Mahout
          Issue Type: New Feature
          Components: Collaborative Filtering
         Environment: Pig/Hadoop 
            Reporter: Prasen Mukherjee
            Priority: Minor


Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter updated MAHOUT-106:
--------------------------------------

    Attachment: plsi-java.patch

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: plsi-java.patch, plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796011#action_12796011 ] 

Ted Dunning commented on MAHOUT-106:
------------------------------------


See here: http://hadoop.apache.org/pig/releases.html

Pig 0.5 works with Hadoop 0.20.

Pig 0.4 works with Hadoop 0.18.

No version works with Hadoop 0.19 or 0.21

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795996#action_12795996 ] 

Grant Ingersoll commented on MAHOUT-106:
----------------------------------------

I still intend to review this for 0.3.

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796007#action_12796007 ] 

Ted Dunning commented on MAHOUT-106:
------------------------------------


Pig programs are a pain in the * because Pig has relatively strict compatibility requirements.



> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795988#action_12795988 ] 

Sean Owen commented on MAHOUT-106:
----------------------------------

Checking the status of this. My hunch is that, all in all, whatever's in Mahout ought to be integrated with Java, and so sounds like that requires a PIG .jar file, and some additional configuration and scripts? Is this available?

I like the idea of having this implementation; would be better if it were 'natively' integrated with the framework in Java but not bad to have this.

Otherwise I can shelve this.

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-106:
-----------------------------

    Fix Version/s:     (was: 0.2)
                   0.3

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Prasen Mukherjee (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasen Mukherjee updated MAHOUT-106:
------------------------------------


One possible suggestion : 

under plsi directory : 

udf-java/
udf-java/ant  ( or maven.  This will build the udf.jar, to be used by the pig scripts  ) 
udf-java/src/
udf-java/lib/ (  Optional -- if we need any third party jars for udf . Not required if we use maven dependencies ) 



> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.2
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-106:
--------------------------------------

    Assignee: Grant Ingersoll

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.2
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-106:
-----------------------------

     Original Estimate:     (was: 96h)
    Remaining Estimate:     (was: 96h)
         Fix Version/s:     (was: 0.4)
                Labels: pig plsi  (was: )

OK, holding it open then for longer.

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: plsi-java.patch, plsi_pig.patch
>
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-106:
-----------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Won't Fix

I hope it's not presumptuous to go ahead and call this shelved. No action on this in 1.5 years and no other PIG has 'stuck' in the project. It's always reopenable.

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796008#action_12796008 ] 

Benson Margulies commented on MAHOUT-106:
-----------------------------------------

* = snout?

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Prasen Mukherjee (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasen Mukherjee updated MAHOUT-106:
------------------------------------

    Attachment: plsi_pig.patch

Attaching patch-file

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Priority: Minor
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Prasen Mukherjee (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasen Mukherjee updated MAHOUT-106:
------------------------------------

    Status: Patch Available  (was: Open)

Patch submitted along with run instructions.  

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Priority: Minor
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Julien Le Dem (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740629#action_12740629 ] 

Julien Le Dem commented on MAHOUT-106:
--------------------------------------

Hi,
First of all, thanks a lot to Prasen for this PLSI implementation :)
2 comments:

1) As is, it just works in pig local mode and has a dependency on Python.
I suggest removing the dependency on Python and update the scripts so it runs also in mapred mode.
If you agree I can propose an updated patch.

2) I've been looking at the complexity of the algorithm.
The computation of Q* produces as many records as number of users * number of stories * number of values of z which get quickly to a pretty big number.
The article states it's been run on a dataset of 61265*1623*30 ~ 3E9 records for Q* I'm looking at the record count as opposed to operations because this is something that will cause IO and a bottleneck in the processing.
Have you tried running it on larger datasets ?
What optimization do you think can be applied to run on larger datasets ?

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.2
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Prasen Mukherjee (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740836#action_12740836 ] 

Prasen Mukherjee commented on MAHOUT-106:
-----------------------------------------

Totally agree with Julien's (1) comment.   I was too lazy to write UDF-java code.

On (2) : The main E/M  code is in plsi_singleiteration.pig.  Although ( as you have rightly pointed out ) the computation of q* produces that many ( s*z*u) results , I feel that in the E/M pig-code we are not loading  that many data at any point of time in the memory. I think at any point of time we are only accessing at most ( s*z or s*u ) number of entries. That too can be eliminated by introducing an algebraic UDF, which is probably happening  at the 1-st and 2-nd m-steps in the following lines 
--compute sum(z,u) = sum_over_z(nq_zu) -- means group by u
--compute sum(s,z) = sum_over_s(nq_sz) -- means group by z


Having said that I will admit that I personally have not run it on very large datasets. 

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.2
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-106:
-----------------------------------

    Fix Version/s: 0.2

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.2
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Julien Le Dem (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796010#action_12796010 ] 

Julien Le Dem commented on MAHOUT-106:
--------------------------------------

Hi Ted,
Could you elaborate on your comment ? What compatibility requirements are you referring to ?

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.3
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-106) PLSI/EM in pig based on hofmann's ACM 04 paper.

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872881#action_12872881 ] 

Sebastian Schelter commented on MAHOUT-106:
-------------------------------------------

I converted the pig code attached here to plain java M/R code hoping to create a plsi implementation for mahout. I got the code working but now I feel kinda stuck and I hope that someone can give me advice or join in on this.

The main flaw of this approach is (as Julien already stated above) that the computation of Q* produces as many records as number of users * number of stories * number of values of z, all of which need to be written to disk which makes this code unusable.  

I took a look into Hofmann's paper and it says that the offline complexity of this algorithm is O(kN) with N being the number of observed ratings, so I don't understand why we would have to look at *all* possible user-item-pairs like it is done in the pig code.

One possible approach to solving this problem could be to only compute Q* for the observed ratings, I've already tried to only write p(s|z)p(z|u) for all oberserved user-item-pairs to disk in the PszPzuReducer (by simply loading all ratings into memory, which would introduce a new constraint on this algorithm...). It seems to help and it works with the sample data provided with the pig code, yet I'm not sure whether it's mathematically correct to do this (so that part is commented out in the code).

I also must admit that I dont exactly see how much this approach corresponds to the plsi approach presented in "Google News Personalization: Scalable Online Collaborative Filtering" (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf), maybe that could be another source for ideas.

The patch is only work in progress, it still uses the old hadoop API, it lacks proper documentation and has only one unit test, it's more a proof of concept. If it turns out this approach here can work for larger data sets I will invest more time to refactor and beautify the code but currently I'm not sure whether it's really going to work.

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models for Collaborative Filtering In ACM Transactions on Information Systems, 2004,  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.