You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/12/10 20:44:58 UTC

Row Similarity

Hi,

It seems there are multiple places where we would like to compute row
similarity (accurate or approximate similarities)

Basically through RowMatrix columnSimilarities we can compute column
similarities of a tall skinny matrix

Similarly we should have an API in RowMatrix called rowSimilarities where
we can compute similar rows in a map-reduce fashion. It will be useful for
following use-cases:

1. Generate topK users for each user from matrix factorization model
2. Generate topK products for each product from matrix factorization model
3. Generate kernel matrix for use in spectral clustering
4. Generate kernel matrix for use in kernel regression/classification

I am not sure if there are already good implementation for map-reduce row
similarity that we can use (ideas like fastfood and kitchen sink felt more
like for classification use-case but for recommendation also user
similarities show up which is unsupervised)...

Is there a JIRA tracking it ? If not I can open one and we can discuss
further on it.

Thanks.
Deb

Re: Row Similarity

Posted by Reza Zadeh <re...@databricks.com>.

Here we go: https://issues.apache.org/jira/browse/SPARK-4823

On Wed, Dec 10, 2014 at 9:01 PM, Debasish Das <de...@gmail.com>
wrote:

> I added code to compute topK products for each user and topK user for each
> product in SPARK-3066..
>
> That is different than row similarity calculation as we need both user and
> product factors to calculate the topK recommendations..
>
> For (1) and (2) we are trying to answer similarUsers to given a user and
> similarProducts to a given product....
>
> similarProducts to a given product is straightforward to compute through
> columnSimilarities/dimsum when products are skinny...
>
> similarUser to a given user will need a map-reduce implementation of row
> similarity since the matrix is tall...
>
> I don't see a JIRA for that yet...Are there any good reference for map
> reduce implementation of row similarity ?
>
> On Wed, Dec 10, 2014 at 2:30 PM, Reza Zadeh <re...@databricks.com> wrote:
>
>> It's not so cheap to compute row similarities when there are many rows,
>> as it amounts to computing the outer product of a matrix A (i.e. computing
>> AA^T, which is expensive).
>>
>> There is a JIRA to track handling (1) and (2) more efficiently than
>> computing all pairs: https://issues.apache.org/jira/browse/SPARK-3066
>>
>>
>>
>> On Wed, Dec 10, 2014 at 2:44 PM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It seems there are multiple places where we would like to compute row
>>> similarity (accurate or approximate similarities)
>>>
>>> Basically through RowMatrix columnSimilarities we can compute column
>>> similarities of a tall skinny matrix
>>>
>>> Similarly we should have an API in RowMatrix called rowSimilarities where
>>> we can compute similar rows in a map-reduce fashion. It will be useful
>>> for
>>> following use-cases:
>>>
>>> 1. Generate topK users for each user from matrix factorization model
>>> 2. Generate topK products for each product from matrix factorization
>>> model
>>> 3. Generate kernel matrix for use in spectral clustering
>>> 4. Generate kernel matrix for use in kernel regression/classification
>>>
>>> I am not sure if there are already good implementation for map-reduce row
>>> similarity that we can use (ideas like fastfood and kitchen sink felt
>>> more
>>> like for classification use-case but for recommendation also user
>>> similarities show up which is unsupervised)...
>>>
>>> Is there a JIRA tracking it ? If not I can open one and we can discuss
>>> further on it.
>>>
>>> Thanks.
>>> Deb
>>>
>>
>>
>

Re: Row Similarity

Posted by Debasish Das <de...@gmail.com>.

I added code to compute topK products for each user and topK user for each
product in SPARK-3066..

That is different than row similarity calculation as we need both user and
product factors to calculate the topK recommendations..

For (1) and (2) we are trying to answer similarUsers to given a user and
similarProducts to a given product....

similarProducts to a given product is straightforward to compute through
columnSimilarities/dimsum when products are skinny...

similarUser to a given user will need a map-reduce implementation of row
similarity since the matrix is tall...

I don't see a JIRA for that yet...Are there any good reference for map
reduce implementation of row similarity ?

On Wed, Dec 10, 2014 at 2:30 PM, Reza Zadeh <re...@databricks.com> wrote:

> It's not so cheap to compute row similarities when there are many rows, as
> it amounts to computing the outer product of a matrix A (i.e. computing
> AA^T, which is expensive).
>
> There is a JIRA to track handling (1) and (2) more efficiently than
> computing all pairs: https://issues.apache.org/jira/browse/SPARK-3066
>
>
>
> On Wed, Dec 10, 2014 at 2:44 PM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi,
>>
>> It seems there are multiple places where we would like to compute row
>> similarity (accurate or approximate similarities)
>>
>> Basically through RowMatrix columnSimilarities we can compute column
>> similarities of a tall skinny matrix
>>
>> Similarly we should have an API in RowMatrix called rowSimilarities where
>> we can compute similar rows in a map-reduce fashion. It will be useful for
>> following use-cases:
>>
>> 1. Generate topK users for each user from matrix factorization model
>> 2. Generate topK products for each product from matrix factorization model
>> 3. Generate kernel matrix for use in spectral clustering
>> 4. Generate kernel matrix for use in kernel regression/classification
>>
>> I am not sure if there are already good implementation for map-reduce row
>> similarity that we can use (ideas like fastfood and kitchen sink felt more
>> like for classification use-case but for recommendation also user
>> similarities show up which is unsupervised)...
>>
>> Is there a JIRA tracking it ? If not I can open one and we can discuss
>> further on it.
>>
>> Thanks.
>> Deb
>>
>
>

Re: Row Similarity

Posted by Reza Zadeh <re...@databricks.com>.

It's not so cheap to compute row similarities when there are many rows, as
it amounts to computing the outer product of a matrix A (i.e. computing
AA^T, which is expensive).

There is a JIRA to track handling (1) and (2) more efficiently than
computing all pairs: https://issues.apache.org/jira/browse/SPARK-3066



On Wed, Dec 10, 2014 at 2:44 PM, Debasish Das <de...@gmail.com>
wrote:

> Hi,
>
> It seems there are multiple places where we would like to compute row
> similarity (accurate or approximate similarities)
>
> Basically through RowMatrix columnSimilarities we can compute column
> similarities of a tall skinny matrix
>
> Similarly we should have an API in RowMatrix called rowSimilarities where
> we can compute similar rows in a map-reduce fashion. It will be useful for
> following use-cases:
>
> 1. Generate topK users for each user from matrix factorization model
> 2. Generate topK products for each product from matrix factorization model
> 3. Generate kernel matrix for use in spectral clustering
> 4. Generate kernel matrix for use in kernel regression/classification
>
> I am not sure if there are already good implementation for map-reduce row
> similarity that we can use (ideas like fastfood and kitchen sink felt more
> like for classification use-case but for recommendation also user
> similarities show up which is unsupervised)...
>
> Is there a JIRA tracking it ? If not I can open one and we can discuss
> further on it.
>
> Thanks.
> Deb
>