You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rex X <dn...@gmail.com> on 2016/05/17 16:24:11 UTC

What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

Each row of the given matrix is Vector[Double]. Want to find out the
nearest neighbor row to each row using cosine similarity.

The problem here is the complexity: O( 10^20 )

We need to do *blocking*, and do the row-wise comparison within each block.
Any tips for best practice?

In Spark, we have RowMatrix.*ColumnSimilarity*, but I didn't find
*RowSimilarity* method.


Thank you.


Regards
Rex

Re: What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

Posted by nguyen duc tuan <ne...@gmail.com>.

There's no *RowSimilarity *method in RowMatrix class. You have to transpose
your matrix to use that method. However, when the number of rows is large,
this approach is still very slow.
Try to use approximate nearest neighbor (ANN) methods instead such as LSH.
There are several implements of LSH on spark that you can find on github.
For example: https://github.com/karlhigley/spark-neighbors.

An other option, you can use ANN libraries on a single machine. There's a
good benchmark of ANN libraries here:
https://github.com/erikbern/ann-benchmarks

2016-05-17 23:24 GMT+07:00 Rex X <dn...@gmail.com>:

> Each row of the given matrix is Vector[Double]. Want to find out the
> nearest neighbor row to each row using cosine similarity.
>
> The problem here is the complexity: O( 10^20 )
>
> We need to do *blocking*, and do the row-wise comparison within each
> block. Any tips for best practice?
>
> In Spark, we have RowMatrix.*ColumnSimilarity*, but I didn't find
> *RowSimilarity* method.
>
>
> Thank you.
>
>
> Regards
> Rex
>
>
>
>