You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Pat Ferrel (JIRA)" <ji...@apache.org> on 2014/06/11 01:11:03 UTC

[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

    [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027159#comment-14027159 ] 

Pat Ferrel edited comment on MAHOUT-1464 at 6/10/14 11:09 PM:
--------------------------------------------------------------

I think the same thing is happening with number of item interactions:

    // Broadcast vector containing the number of interactions with each thing
    val bcastNumInteractions = drmBroadcast(drmI.colSums)// sums?

This broadcasts a vector of sums. We need a getNumNonZeroElements() for column vectors,  actually a way to get a Vector of nonZero counts per column? We could get them from rows of the transposed matrix before doing the multiply of At %*% A or B.t %*% A in which case we’d get non-zero counts from the rows. Either way I don’t see a way to get a vector of these values without doing a mapBlock on the transposed matrix. Am I missing something?

Currently the IndexedDataset is a very thin wrapper but I could add two vectors, which contain number of non-zero elements for rows and columns. In this case I would have it extend CheckpointedDrm perhaps. Since CheckpointedDrm extends DrmLike it could be used in the DSL algebra directly, in which case it would be simple to do the right thing with these vectors as well as the two id dictionaries for transpose and multiply but it’s a slippery slope.

Before I go off in the wrong direction is there an existing way to get a vector of non-zero counts for rows or columns?



was (Author: pferrel):
I think the same thing is happening with number of item interactions:

    // Broadcast vector containing the number of interactions with each thing
    val bcastNumInteractions = drmBroadcast(drmI.colSums)// sums?

This broadcasts a vector of sums. We need a getNumNonZeroElements() for column vectors actually a way to get a Vector of nonZero counts per column? We could get them from rows of the transposed matrix before doing the multiply of At %*% A or B.t %*% A in which case we’d get nin-zero counts from the rows. Either way I don’t see a way to get a vector of these values without doing a mapBlock on the transposed matrix. Am I missing something?

Currently the IndexedDataset is a very thin wrapper but I could add two vectors, which contain number of non-zero elements for rows and columns. In this case I would have it extend CheckpointedDrm perhaps. Since CheckpointedDrm extends DrmLike it could be used in the DSL algebra directly, in which case it would be simple to do the right thing with these vectors as well as the two id dictionaries for transpose and multiply but it’s a slippery slope.

Before I go off in the wrong direction is there an existing way to get a vector of non-zero counts for rows or columns?


> Cooccurrence Analysis on Spark
> ------------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)